AI PDF Tools: A Lightweight, Privacy-Focused Web Application for Local PDF Processing

Romi Nur Ismanto
Independent Software Engineering Research, Jakarta, Indonesia
rominur@gmail.com
February 2025

Abstract

We present AI PDF Tools, an open-source, privacy-focused web application that provides a comprehensive suite of PDF processing utilities operating entirely on local infrastructure. The system addresses growing concerns surrounding document privacy by eliminating reliance on cloud-based PDF services, ensuring that sensitive documents never leave the user's machine. Built on a Flask backend with a vanilla HTML/CSS/JavaScript frontend, AI PDF Tools delivers six core capabilities: PDF-to-Word conversion via pdf2docx, intelligent PDF compression with three configurable quality levels using pypdf, optical character recognition through Tesseract OCR supporting 20+ languages, multi-document merging, page-range splitting, and password-protected PDF unlocking. The application implements automatic temporary file cleanup, a RESTful API architecture with six dedicated endpoints, and supports containerized deployment through Docker. We describe the system architecture, implementation details, security considerations, and deployment strategies that enable a zero-cloud, production-ready PDF processing environment suitable for individuals and organizations handling confidential documents.

Keywords: PDF processing, document conversion, optical character recognition, privacy-preserving tools, web application, Flask, open-source, local processing, Tesseract OCR, Docker

1. Introduction

Portable Document Format (PDF) remains the dominant standard for document exchange across academic, legal, governmental, and corporate contexts. Users routinely need to convert, compress, merge, split, or extract text from PDF files. However, the vast majority of available tools for these tasks operate as cloud-based services, requiring users to upload potentially sensitive documents to third-party servers. This introduces significant privacy risks, particularly for documents containing personally identifiable information, legal contracts, medical records, financial statements, or proprietary business data.

Commercial cloud-based PDF services such as Adobe Acrobat Online, SmallPDF, and ILovePDF process millions of documents daily on remote infrastructure. While these services offer convenience, they require users to relinquish control over their data, accept terms of service that may permit data retention or analysis, and maintain persistent internet connectivity. For privacy-conscious users, regulated industries, and air-gapped environments, these requirements are untenable.

AI PDF Tools addresses these challenges by providing a complete, locally-executed PDF processing toolkit that requires zero cloud connectivity and transmits no data to external servers. The system combines mature open-source libraries into a unified, user-friendly web application that runs on the user's own hardware.

The key contributions of this work are:

2. Related Work

PDF processing tools span a wide spectrum from desktop applications to cloud services. Adobe Acrobat Pro DC provides comprehensive PDF editing capabilities but requires expensive licensing and increasingly relies on cloud features. Open-source alternatives such as LibreOffice and PDFtk offer command-line PDF manipulation but lack unified web interfaces and require manual operation for each task.

Web-based PDF services have proliferated in recent years. SmallPDF (Hurlimann, 2013) and ILovePDF provide browser-based PDF operations with intuitive interfaces, but all processing occurs on remote servers. These services typically impose file size limits, daily usage caps, and retain uploaded documents for varying periods. For organizations subject to data protection regulations such as GDPR, HIPAA, or SOX, transmitting documents to these services may constitute a compliance violation.

In the open-source ecosystem, several Python libraries provide individual PDF processing capabilities. The pypdf library (Fenniak, 2005; maintained by Stahl, 2022) offers pure-Python PDF manipulation including merging, splitting, and metadata extraction. The pdf2docx library (Wang, 2020) enables high-fidelity PDF-to-Word conversion by parsing PDF page layouts and reconstructing them in DOCX format. Tesseract OCR (Smith, 2007), originally developed at HP Labs and now maintained by Google, remains the most widely deployed open-source OCR engine with support for over 100 languages.

AI PDF Tools distinguishes itself by integrating these individual capabilities into a unified, privacy-focused web application with a modern user interface, automatic resource management, and multiple deployment options—filling a gap between fragmented command-line tools and privacy-compromising cloud services.

3. System Architecture

AI PDF Tools follows a client-server architecture organized into three primary layers: a presentation layer (frontend), an application layer (Flask backend), and a processing layer (PDF library integrations). The system is designed for single-machine deployment where all layers execute on the same host, eliminating network-based data exposure.

3.1 Processing Pipeline

The application employs a RESTful API design pattern where the frontend communicates with the backend through six dedicated HTTP endpoints, each corresponding to a core PDF operation. All file transfers occur over the local loopback interface, and processed files are stored temporarily on the local filesystem before being served back to the client for download.

User Upload → Flask API Endpoint → Library Processing → Temp File Storage → Client Download → Scheduled Cleanup

3.2 Backend Architecture

The Flask backend serves as the central orchestration layer, handling HTTP request routing, file upload validation, library invocation, and response formatting. File uploads are validated against a configurable maximum size limit of 50 MB, enforced through Flask's MAX_CONTENT_LENGTH configuration parameter. Temporary files are stored in a dedicated temp/ directory, with a background cleanup thread running every 10 minutes to remove files older than 1 hour.

3.3 Frontend Architecture

The frontend is implemented entirely in vanilla HTML, CSS, and JavaScript without external framework dependencies, minimizing the attack surface and eliminating supply-chain risks from third-party npm packages. The interface employs a tab-based navigation pattern where each PDF tool occupies its own tab panel. File uploads are handled through both traditional file input elements and a drag-and-drop interface.

3.4 API Design

Table 1: REST API endpoint specifications
Endpoint Method Function Input Output
/convertPOSTPDF to WordPDF fileDOCX file
/compressPOSTCompress PDFPDF file + quality levelCompressed PDF
/ocrPOSTOCR PDFPDF file + languageSearchable PDF
/mergePOSTMerge PDFsMultiple PDF filesMerged PDF
/splitPOSTSplit PDFPDF file + page rangeSplit PDF
/unlockPOSTUnlock PDFPDF file + passwordUnlocked PDF

4. Implementation Details

4.1 Technology Stack

Table 2: Core technology stack and dependencies
Component Technology Purpose
Backend Framework Flask (Python 3.10+) Web server, API routing, file handling
PDF to Word pdf2docx Layout-preserving PDF to DOCX conversion
PDF Manipulation pypdf Compression, merging, splitting, unlocking
OCR Engine Tesseract OCR Optical character recognition (20+ languages)
Image Processing Pillow (PIL) Image conversion for OCR pipeline
PDF Rendering poppler-utils PDF page to image rasterization
Frontend Vanilla HTML/CSS/JS Zero-dependency web interface
Containerization Docker Portable deployment environment

4.2 PDF-to-Word Conversion

The PDF-to-Word conversion module leverages the pdf2docx library, which parses the internal structure of PDF files—including text blocks, images, tables, and vector graphics—and reconstructs them as equivalent DOCX elements. Unlike simpler approaches that extract raw text and reflow it into a new document, pdf2docx preserves the spatial layout, font styling, table structures, and embedded images of the original PDF. The conversion process operates page by page, mapping PDF coordinate-space elements to Word document paragraphs, runs, and table cells.

from pdf2docx import Converter

def convert_pdf_to_word(pdf_path, docx_path):
    cv = Converter(pdf_path)
    cv.convert(docx_path)
    cv.close()

4.3 PDF Compression

The compression module implements three quality tiers using pypdf's content stream optimization capabilities. Each tier applies increasingly aggressive compression strategies to reduce file size while balancing output quality:

Table 3: Compression quality levels and strategies
Level Strategy Typical Reduction Use Case
Low Compression Lossless stream optimization 10–30% Archival, high-quality retention
Medium Compression Image resampling + stream optimization 30–60% Email attachments, general sharing
High Compression Aggressive image downsampling + object removal 50–80% Web publishing, storage optimization

The compression engine reads each page of the input PDF, applies the selected optimization strategy through pypdf's compress_content_streams method, and writes the optimized output. Image-heavy documents benefit most from compression, while text-only PDFs see modest but consistent size reductions through content stream deduplication.

4.4 Optical Character Recognition

The OCR pipeline converts scanned or image-based PDF pages into searchable, selectable text through a three-stage process:

PDF Page Rasterization (poppler-utils) → Image Preprocessing (Pillow) → Text Extraction (Tesseract OCR)

Stage 1 — Rasterization: Each PDF page is rendered to a high-resolution raster image using poppler-utils' pdftoppm command. The rendering resolution is set to 300 DPI by default, balancing OCR accuracy with processing time.

Stage 2 — Preprocessing: Rendered images are preprocessed using Pillow for optimal OCR performance. Preprocessing steps include grayscale conversion, contrast enhancement, and noise reduction, improving recognition accuracy particularly for scanned documents with degraded image quality.

Stage 3 — Recognition: Tesseract OCR processes each preprocessed page image, extracting text with positional metadata. The system supports over 20 languages including English, Indonesian, Chinese (Simplified and Traditional), Japanese, Korean, Arabic, Hindi, German, French, Spanish, Portuguese, Russian, Italian, Dutch, Thai, Vietnamese, Turkish, Polish, Swedish, and Danish. Users select the target language through the web interface, enabling Tesseract's language-specific recognition models.

4.5 PDF Merging and Splitting

The merge operation combines multiple PDF files into a single output document while preserving the page ordering, bookmarks, and metadata of each input file. The implementation uses pypdf's PdfMerger class, which handles cross-reference table reconciliation and resource dictionary merging to produce a valid combined PDF. The split operation extracts a specified range of pages from a PDF document into a new file using pypdf's PdfReader and PdfWriter classes, preserving all page content, annotations, and formatting within the extracted range.

4.6 PDF Unlocking

The unlock module removes password protection from encrypted PDF files. Users provide the document password through the web interface, and the system uses pypdf's decryption capabilities to authenticate and produce an unprotected copy. This feature is designed for legitimate use cases where users need to remove password protection from their own documents for workflow integration or archival purposes.

from pypdf import PdfReader, PdfWriter

def unlock_pdf(input_path, output_path, password):
    reader = PdfReader(input_path)
    reader.decrypt(password)
    writer = PdfWriter()
    for page in reader.pages:
        writer.add_page(page)
    writer.write(output_path)

5. Core Features and User Interface

5.1 Tab-Based Navigation

The web interface organizes the six PDF tools into a horizontal tab bar, allowing users to switch between operations without page reloads. Each tab contains a self-contained tool panel with its own file upload area, configuration options, and download controls. This design pattern reduces cognitive overhead by presenting only the relevant controls for the selected operation while maintaining instant access to all tools.

5.2 Drag-and-Drop Upload

All six tools support drag-and-drop file upload in addition to traditional file browser dialogs. The drag-and-drop zones provide visual feedback during hover states, confirming valid drop targets and accepted file types. This interaction pattern significantly improves usability for users processing multiple documents in succession.

5.3 Feature Summary

Table 4: Complete feature matrix of AI PDF Tools
Feature Library Input Output Configuration
PDF to Word pdf2docx Single PDF DOCX None (automatic)
Compress PDF pypdf Single PDF Compressed PDF 3 quality levels
OCR PDF Tesseract OCR Scanned PDF Searchable PDF Language selection (20+)
Merge PDF pypdf Multiple PDFs Merged PDF File ordering
Split PDF pypdf Single PDF Extracted pages Page range specification
Unlock PDF pypdf Encrypted PDF Unlocked PDF Password input

6. Security and Privacy Considerations

AI PDF Tools was designed with privacy as a foundational architectural requirement rather than an afterthought. The system implements multiple layers of protection to ensure complete document confidentiality.

6.1 Zero Data Transmission

All PDF processing occurs exclusively on the local machine. The Flask server binds to the loopback interface by default, and no outbound network connections are established during document processing. The application contains no analytics tracking, telemetry collection, or crash reporting mechanisms that could inadvertently transmit document data or metadata to external services.

6.2 Automatic File Cleanup

Temporary files generated during PDF processing represent a potential data exposure vector if left on disk indefinitely. AI PDF Tools implements an automatic cleanup mechanism using a background scheduler thread that executes every 10 minutes. The cleanup routine identifies and removes all temporary files older than 1 hour from the temp/ directory, ensuring that processed documents do not persist beyond their useful lifecycle.

# Automatic temp file cleanup configuration
# Runs every 10 minutes, removes files older than 1 hour
CLEANUP_INTERVAL = 600     # 10 minutes in seconds
MAX_FILE_AGE = 3600        # 1 hour in seconds

6.3 Upload Size Enforcement

The application enforces a configurable maximum upload size of 50 MB per file through Flask's built-in content length validation. This prevents resource exhaustion attacks and ensures predictable memory consumption during processing. The limit can be adjusted through environment variables for deployments requiring larger documents.

6.4 Open-Source Auditability

The complete source code is published under the MIT License, enabling independent security auditing by any interested party. All dependencies are established, widely-used open-source projects with active maintenance communities, reducing the risk of supply-chain vulnerabilities.

7. Performance Characteristics

AI PDF Tools performance varies across operations depending on document complexity, page count, and the specific processing task. The following benchmarks characterize expected processing times for representative document types.

Table 5: Processing time benchmarks by operation type
Operation Document Type Pages File Size Processing Time
PDF to WordText-heavy report202.1 MB3–8 seconds
Compress (Medium)Image-heavy presentation3015 MB2–5 seconds
OCRScanned document108 MB15–45 seconds
MergeMixed documents50 (total)25 MB (total)1–3 seconds
SplitLarge report10010 MB<1 second
UnlockEncrypted document153 MB<1 second

OCR operations are the most computationally intensive due to the rasterization and character recognition stages. Processing time scales linearly with page count and is influenced by image complexity and the selected language model. Compression, merging, splitting, and unlocking operations complete in near-real-time for typical document sizes, as these operations involve primarily in-memory stream manipulation rather than pixel-level processing.

8. Deployment Strategies

AI PDF Tools supports three deployment modes to accommodate different infrastructure requirements and operational contexts.

8.1 Native Python Deployment

The simplest deployment method runs the application directly using a Python virtual environment. This approach is suitable for individual users and small teams with Python 3.10+ installed. System-level dependencies including Tesseract OCR and poppler-utils must be installed separately through the operating system's package manager:

pip install -r requirements.txt
python app.py

8.2 Docker Containerized Deployment

For reproducible, isolated deployments, AI PDF Tools provides a Docker configuration that packages the application with all system-level dependencies. The Docker image is based on a Python 3.10 slim base image with Tesseract OCR and poppler-utils pre-installed, eliminating manual dependency management and providing filesystem isolation:

docker build -t pdf-tools .
docker run -p 5000:5000 pdf-tools

8.3 Cloud Platform Deployment

For teams requiring network-accessible deployment, AI PDF Tools includes a Procfile for deployment to platforms supporting the Buildpack specification (Heroku, Railway, Render). The Procfile configures Gunicorn as the production WSGI server with dynamic port binding through the $PORT environment variable:

web: gunicorn app:app --bind 0.0.0.0:$PORT

While cloud deployment inherently involves transmitting documents over the network, this approach remains preferable to third-party PDF services because the user controls the server infrastructure and the open-source codebase provides full transparency into data handling.

9. Discussion

9.1 Design Trade-offs

Several deliberate design trade-offs shaped the architecture of AI PDF Tools. The decision to use vanilla HTML/CSS/JavaScript for the frontend eliminates build toolchain complexity and third-party dependency risks but limits the sophistication of interactive UI elements compared to frameworks such as React or Vue.js. For a utility-focused application where users interact briefly to process documents, this trade-off favors simplicity and reliability over rich interactivity.

The choice of pypdf over alternative libraries such as PyMuPDF (fitz) reflects a preference for pure-Python implementation and permissive licensing. While PyMuPDF offers faster rendering and broader format support, its AGPL license introduces compliance considerations for commercial deployments. pypdf's BSD license aligns with AI PDF Tools' MIT licensing and minimizes legal friction for downstream users.

9.2 Limitations

The current system has several known limitations. PDF-to-Word conversion accuracy depends on the structural complexity of the source PDF; documents with intricate layouts, nested tables, or custom fonts may experience fidelity loss during conversion. OCR accuracy is constrained by the quality of the input scan and the selected Tesseract language model, with handwritten text and decorative fonts presenting particular challenges. The 50 MB file size limit, while configurable, may be insufficient for large document archives or high-resolution scanned volumes.

9.3 Comparison with Existing Solutions

Table 6: Comparison with existing PDF processing solutions
Feature AI PDF Tools SmallPDF Adobe Acrobat PDFtk (CLI)
Local Processing Yes (fully local) No (cloud) Partial Yes
Web Interface Yes Yes Yes No
OCR Support Yes (20+ languages) Yes Yes No
Cost Free (MIT) Freemium Subscription Free
Open Source Yes No No Yes (GPL)
Docker Support Yes N/A No No
Privacy Guarantee Complete Limited Partial Complete

10. Conclusion and Future Work

AI PDF Tools demonstrates that a comprehensive, production-quality PDF processing toolkit can be delivered as a lightweight, privacy-preserving web application operating entirely on local infrastructure. By integrating established open-source libraries—pdf2docx, pypdf, Tesseract OCR, Pillow, and poppler-utils—into a unified Flask-based application with a modern web interface, the system eliminates the need for cloud-based PDF services and the privacy compromises they entail.

The system's architecture prioritizes simplicity, auditability, and deployment flexibility, making it suitable for individual users, small teams, and organizations with strict data handling requirements. The automatic temporary file cleanup, configurable upload limits, and Docker containerization support enable reliable operation in production environments without ongoing maintenance overhead.

Future development directions include:

The complete source code is available at https://github.com/romizone/PDFtoword under the MIT License.

References

  1. Fenniak, M. (2005). pyPdf: A Pure-Python PDF Library. Subsequently maintained as pypdf by Stahl, M. et al. (2022). GitHub Repository. https://github.com/py-pdf/pypdf
  2. Wang, D. (2020). pdf2docx: Parse PDF to DOCX with layout preservation. GitHub Repository. https://github.com/dothinking/pdf2docx
  3. Smith, R. (2007). An Overview of the Tesseract OCR Engine. Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), pp. 629-633.
  4. Clark, A. (2015). Pillow: The Friendly PIL Fork. Python Package Index. https://python-pillow.org/
  5. Freedesktop.org (2005). Poppler: A PDF Rendering Library. Open Source Project. https://poppler.freedesktop.org/
  6. Ronacher, A. (2010). Flask: A Lightweight WSGI Web Application Framework. GitHub Repository. https://github.com/pallets/flask
  7. Merkel, D. (2014). Docker: Lightweight Linux Containers for Consistent Development and Deployment. Linux Journal, 2014(239), 2.
  8. Adobe Systems (1993). PDF Reference: Adobe Portable Document Format. Adobe Systems Incorporated. Version 1.0–2.0.
  9. European Parliament and Council (2016). General Data Protection Regulation (GDPR). Regulation (EU) 2016/679.
  10. ISO (2005). Document Management—Electronic Document File Format for Long-Term Preservation (PDF/A). ISO 19005-1:2005.