We present Open Chatbot, a privacy-first AI chatbot designed to process sensitive documents without ever transmitting original files to cloud AI providers. Built on a Privacy by Design architecture, the system extracts text locally on the user's server using specialized processing engines (PDFtotext, Mammoth, SheetJS, Tesseract OCR) and transmits only plain text JSON chunks—capped at 30KB per file—to AI inference endpoints. This three-stage pipeline (local extraction, text-only transmission, immediate cleanup) ensures that no binary files, document metadata, or structural information ever leaves the user's infrastructure. The system supports multiple AI providers (DeepSeek, OpenAI, Anthropic), handles diverse file formats including PDF, DOCX, XLSX, images with OCR, and 20+ source code formats, and features streaming responses, LaTeX rendering, syntax highlighting, and multi-session chat history. We describe the architecture, security model, and design decisions that enable privacy-preserving AI document interaction suitable for processing confidential financial, legal, and medical documents.
The rapid adoption of AI-powered chatbots such as ChatGPT, Claude, and Gemini has transformed how organizations interact with documents. Users routinely upload financial reports, legal contracts, medical records, and proprietary research to these services for summarization, analysis, and question-answering. However, this convenience creates a fundamental tension: the original files—often containing highly sensitive information—are transmitted in full to third-party servers, where they may be stored, logged with metadata, or potentially used for model training.
Consider the standard upload flow: when a user attaches a 5MB PDF to ChatGPT, the entire binary file is sent to OpenAI's servers. The document structure, embedded fonts, metadata (author, creation date, revision history), and complete content are all transmitted and stored alongside the user's account information. For organizations bound by data protection regulations (GDPR, HIPAA, banking secrecy laws), this presents unacceptable compliance risks.
Open Chatbot addresses this problem through a fundamental architectural principle: files never leave your infrastructure. Instead of transmitting documents to AI providers, the system processes files locally, extracts only plain text content, and sends compact text-only payloads to AI inference endpoints. The key contributions of this work are:
The challenge of privacy-preserving AI interaction has been approached from several directions. Retrieval-Augmented Generation (RAG) systems (Lewis et al., 2020) store document embeddings in vector databases and retrieve relevant chunks during inference, but still require document content to be processed and stored in some form. Fully local LLM solutions (e.g., llama.cpp, Ollama) eliminate cloud dependency entirely but sacrifice the quality advantages of large-scale commercial models.
Existing privacy-focused chatbot solutions typically fall into two categories: fully local systems that run smaller models on-device, or enterprise API platforms with contractual data protection guarantees. Open Chatbot occupies a unique middle ground: it leverages the superior capabilities of large commercial models (GPT-4o, Claude Sonnet 4.5) while ensuring that only extracted text—not original files—reaches these services. This approach provides state-of-the-art AI quality without compromising document confidentiality.
The Vercel AI SDK (Vercel, 2024) provides the streaming infrastructure that enables real-time response delivery. Combined with specialized extraction libraries (pdf-parse, Mammoth, SheetJS), the system achieves comprehensive format coverage without relying on cloud-based document processing services.
Open Chatbot follows a modular, three-stage architecture designed around the principle that document content should be minimally exposed during AI interaction. The system comprises a local document processing engine, a text-only API gateway, and a reactive client interface.
The core privacy guarantee is implemented through a strict three-stage pipeline:
Stage 1 — Local Processing: Files uploaded by the user are received by the Next.js API route (/api/upload/route.ts) and processed entirely on the local server. Each file type is handled by a specialized extraction engine: PDFtotext for PDF documents, Mammoth and word-extractor for DOCX/DOC files, SheetJS for spreadsheets (XLSX/XLS/CSV), and Tesseract OCR for scanned images (PNG, JPG, BMP, TIFF, WebP). The extraction output is pure text content with no binary data, embedded objects, or metadata preserved.
Stage 2 — Text-Only Transmission: The extracted text is formatted as standard JSON payloads conforming to the AI provider's chat API format. Each file's content is capped at 30KB before transmission, ensuring that even for large documents, the data exposure is strictly limited. The API request contains only: {"role": "user", "content": "extracted text..."}—no file attachments, no binary encodings, no metadata.
Stage 3 — Immediate Cleanup: After text extraction completes, the original file buffer is immediately released from memory using JavaScript finally blocks. No temporary files are written to disk, and no persistent storage of uploaded documents occurs on the server. The file exists only transiently in memory during the extraction process.
The system supports three AI provider families through a unified streaming interface built on Vercel AI SDK 6's streamText function:
| Provider | Models | Max Tokens | Characteristics |
|---|---|---|---|
| DeepSeek | Chat, Reasoner | 8K–16K | Cost-effective, strong reasoning |
| OpenAI | GPT-4o, GPT-4o Mini, GPT-4.1 | 16K–32K | Highest versatility, broad capabilities |
| Anthropic | Claude Sonnet 4.5, Claude Haiku 4.5 | 8K–16K | Strong analysis, safety-focused |
Each provider requires its own API key, stored exclusively in the browser's localStorage—never transmitted to or stored on the application server. The chat API route (/api/chat/route.ts) dynamically selects the appropriate SDK provider based on the user's configuration, enabling seamless switching between models without code changes.
The file processing subsystem (lib/file-processor.ts) implements format-specific extraction strategies:
| Format | Extensions | Extraction Engine | Notes |
|---|---|---|---|
| pdftotext + Tesseract OCR | Text + OCR fallback for scanned pages (max 20 pages) | ||
| Word | .docx, .doc | Mammoth, word-extractor | DOCX via Mammoth, legacy DOC via word-extractor |
| Spreadsheet | .xlsx, .xls, .csv | SheetJS (xlsx) | Converted to structured text tables |
| Images | .png, .jpg, .bmp, .tiff, .webp | Tesseract OCR | Optical character recognition |
| Text & Code | .txt, .md, .json, .xml, .html, .py, .js, .ts, etc. | Direct UTF-8 read | 20+ source code formats supported |
Open Chatbot's security model is built on the principle of minimal data exposure. We formalize the security guarantees through a comparison with traditional AI document upload workflows:
| Aspect | Traditional AI Upload | Open Chatbot |
|---|---|---|
| Original file sent to cloud | Yes — full binary | No — never |
| Document metadata transmitted | Yes — author, dates, revisions | No — stripped during extraction |
| Binary data exposure | Complete file (MBs) | Zero — text only |
| Data size to cloud | Full file size | Max 30KB per file |
| Used for model training | Possibly (varies by provider) | No — API mode excludes training |
| Data retention control | Minimal — provider-dependent | Full — user-controlled |
| Regulatory compliance | Complex — DPA required | Simplified — no file transfer |
API keys for each provider are stored exclusively in the browser's localStorage, never transmitted to the application backend except as part of authenticated API requests to the respective AI provider. This design ensures that even the application server operator cannot access users' API credentials, and no centralized key database exists as an attack surface.
The Next.js API routes operate in a completely stateless manner. No session data, uploaded files, or chat histories are persisted on the server. Each request is processed independently, and all file buffers are released immediately after extraction. The server maintains zero knowledge of previous interactions, providing natural protection against data accumulation risks.
| Layer | Technology | Purpose |
|---|---|---|
| Framework | Next.js 16 (App Router, Turbopack) | Full-stack React framework with API routes |
| UI Library | React 19, Tailwind CSS 4, Radix UI | Responsive, accessible component library |
| AI Integration | Vercel AI SDK 6 (streamText) |
Unified streaming interface for multi-provider |
| PDF Processing | pdftotext, Tesseract | Text extraction and OCR for scanned documents |
| Word Processing | Mammoth, word-extractor | DOCX/DOC text extraction |
| Spreadsheet Processing | SheetJS (xlsx) | Excel/CSV to structured text |
| Image OCR | Tesseract.js | Optical character recognition for images |
| Math Rendering | KaTeX, remark-math, rehype-katex | LaTeX formula rendering in responses |
| Code Display | react-syntax-highlighter (Prism) | Syntax-highlighted code blocks |
| Language | TypeScript 5 (strict mode) | Type-safe development |
The chat interface is designed as a responsive single-page application with several advanced features:
Response delivery uses Vercel AI SDK 6's streamText function, which establishes a Server-Sent Events (SSE) connection between the client and the Next.js API route. The API route, in turn, streams responses from the selected AI provider. This double-streaming architecture ensures that users see response tokens as they are generated, providing a responsive experience even for long AI outputs.
// Simplified streaming flow
const result = streamText({
model: selectedProvider(modelId),
messages: conversationHistory,
maxTokens: modelConfig.maxTokens,
});
return result.toDataStreamResponse();
Open Chatbot supports three deployment configurations, each with different trade-offs between convenience and feature completeness:
The recommended deployment for maximum privacy and full feature support, including OCR capabilities:
git clone https://github.com/romizone/chatbot-next.git cd chatbot-next npm install npm run build npm start
This configuration supports all file formats including OCR for scanned PDFs and images, as Tesseract and poppler-utils can be installed on the host system.
A Dockerfile is provided with all OCR dependencies (Tesseract, Poppler) pre-installed:
docker build -t open-chatbot . docker run -p 3000:3000 open-chatbot
For quick deployment without infrastructure management. Note that OCR features are limited in serverless environments due to binary dependency constraints:
npx vercel
The codebase follows Next.js 16 App Router conventions with a clear separation between API routes, UI components, state management, and processing logic:
src/
├── app/
│ ├── api/
│ │ ├── chat/route.ts # Multi-provider streaming endpoint
│ │ └── upload/route.ts # Local file processing endpoint
│ ├── layout.tsx # Root layout with providers
│ └── page.tsx # Entry point
├── components/chat/
│ ├── chat-page.tsx # Main chat container
│ ├── chat-area.tsx # Message display with markdown
│ ├── chat-input.tsx # Input with file attachment
│ ├── markdown-renderer.tsx # KaTeX + syntax highlighting
│ ├── settings-dialog.tsx # Provider/model configuration
│ └── sidebar.tsx # Session management
├── hooks/
│ └── use-chat-store.ts # Zustand/localStorage state
└── lib/
├── file-processor.ts # Document extraction engine
└── constants.ts # Model configs, providers
Open Chatbot demonstrates that it is possible to leverage state-of-the-art commercial AI models for document analysis while maintaining strict privacy guarantees over sensitive files. The three-stage privacy pipeline—local extraction, text-only transmission, immediate cleanup—provides a practical and auditable approach to the fundamental tension between AI capability and data confidentiality.
The system's multi-provider architecture ensures that users are not locked into a single AI vendor, while client-side key management eliminates centralized credential exposure. The comprehensive file format support, combined with streaming responses, LaTeX rendering, and syntax highlighting, delivers a production-quality user experience comparable to commercial AI chatbots.
Future directions include:
The complete source code is available at https://github.com/romizone/chatbot-next under the MIT license.