We present RAG SOP Assistant, an AI-powered question-answering system that transforms static corporate Standard Operating Procedure (SOP) documents into a dynamic, conversational knowledge base. Built on a Retrieval-Augmented Generation (RAG) architecture, the system enables employees to ask natural language questions about company procedures and receive accurate, source-attributed answers grounded in official SOP documents. The pipeline processes PDF, DOCX, and TXT files through intelligent sentence-boundary-aware chunking (~500 characters with 100-character overlap), generates 384-dimensional semantic embeddings using the multilingual E5-Small model, and stores them in a ChromaDB vector database with cosine similarity search. At query time, the system retrieves the top-5 most relevant document chunks and passes them as context to DeepSeek-V3 for answer generation. The application features a three-tab Gradio interface for Q&A chat, document upload, and database management, with thread-safe operations, XSS prevention, input validation, and automatic indexing of default SOP documents at startup. We describe the architecture, chunking strategy, embedding pipeline, security model, and deployment options for this enterprise-ready knowledge management tool.
Standard Operating Procedures (SOPs) form the backbone of organizational knowledge management. These documents codify critical business processes—from emergency response protocols to staff training procedures, technology usage guidelines, and equipment maintenance workflows. However, in practice, SOP documents are frequently lengthy, distributed across multiple files, and written in formal procedural language that makes rapid information retrieval challenging for employees who need quick, actionable answers.
The traditional approach to SOP consultation involves manual document search: employees must identify the correct document, navigate to the relevant section, and interpret the procedural language in context. This process is time-consuming, error-prone, and particularly problematic during high-pressure situations where quick access to emergency procedures is critical. Keyword-based search tools offer incremental improvement but fail to capture semantic meaning—a search for "what to do during a fire" will not match a section titled "Emergency Evacuation Protocol."
RAG SOP Assistant addresses these limitations by combining semantic search with large language model (LLM) generation. The system enables employees to ask questions in natural, conversational language and receive contextually accurate answers with explicit source attribution to the originating SOP document. The key contributions of this work are:
Retrieval-Augmented Generation (Lewis et al., 2020) introduced the paradigm of combining document retrieval with neural text generation, demonstrating significant improvements over purely parametric models on knowledge-intensive tasks. The RAG approach addresses the fundamental limitation of LLMs—that their knowledge is frozen at training time—by grounding generation in retrieved evidence from an external knowledge base.
Vector databases such as ChromaDB (Trychroma, 2023), Pinecone, and Qdrant have emerged as purpose-built storage systems for embedding-based retrieval, offering efficient approximate nearest neighbor (ANN) search over high-dimensional vector spaces. ChromaDB distinguishes itself through its lightweight, embeddable design that supports both in-memory and persistent storage modes, making it suitable for applications ranging from prototypes to production deployments.
Multilingual embedding models, particularly the E5 family (Wang et al., 2022), have demonstrated strong cross-lingual semantic similarity performance. The E5-Small variant used in this work provides 384-dimensional embeddings with support for 100+ languages, balancing retrieval quality with computational efficiency—a critical consideration for deployments on resource-constrained hardware such as Hugging Face Spaces.
Prior work on document Q&A systems has largely focused on general-purpose applications. RAG SOP Assistant specializes in the corporate SOP domain, where documents follow structured procedural formats and answers must be traceable to authoritative sources—requirements that demand careful attention to chunking strategy, source preservation, and answer grounding.
RAG SOP Assistant follows a modular pipeline architecture comprising four stages: document ingestion, text chunking and embedding, vector storage, and retrieval-augmented generation. Each stage is designed for reliability, thread safety, and enterprise deployment.
The system supports three document formats commonly used for SOPs in Indonesian corporate environments:
| Format | Extension | Extraction Engine | Notes |
|---|---|---|---|
| PyMuPDF (fitz) | Page-by-page text extraction with layout preservation | ||
| Word | .docx | python-docx | Paragraph-level text extraction |
| Plain Text | .txt | UTF-8 read | Direct content ingestion |
Upon upload, files are validated against size limits (50MB maximum) and format whitelist. The extraction process is wrapped in comprehensive error handling with sanitized error messages that never expose internal paths or API credentials.
Text chunking is a critical determinant of RAG system quality. Chunks that are too large dilute the semantic signal; chunks that are too small lose contextual coherence. RAG SOP Assistant implements a sentence-boundary-aware chunking strategy with the following parameters:
| Parameter | Value | Rationale |
|---|---|---|
| Chunk Size | ~500 characters | Balances semantic density with contextual completeness |
| Chunk Overlap | 100 characters | Ensures continuity across chunk boundaries |
| Boundary Detection | Sentence-aware (. ! ? \n) | Preserves semantic coherence of procedural steps |
| Boundary Validation | >30% of chunk_size | Falls back to character-based split if no good boundary exists |
The chunking algorithm first attempts to split text at natural sentence boundaries (periods, exclamation marks, question marks, newlines). If no suitable boundary is found within the target window, it falls back to character-based splitting. This approach is particularly important for SOP documents, where procedural steps often span multiple sentences and should not be split mid-instruction.
Each text chunk is converted to a 384-dimensional vector using the intfloat/multilingual-e5-small embedding model from the sentence-transformers library. The E5-Small model was selected for three reasons:
Embeddings are stored in a ChromaDB persistent collection configured with cosine similarity as the distance metric. Each chunk is stored with associated metadata:
collection.add(
ids=[md5_hash(f"{filename}_chunk_{index}")],
embeddings=[embedding_vector],
documents=[chunk_text],
metadatas=[{"source": filename, "chunk_index": index}]
)
The use of MD5-based chunk IDs enables idempotent insertion—re-uploading the same document does not create duplicate entries. The PersistentClient stores data at ./chroma_db, ensuring that the vector index survives application restarts.
At query time, the user's question is embedded using the same E5-Small model and searched against the ChromaDB collection using cosine similarity. The top-5 most similar chunks are retrieved along with their source metadata. These chunks are assembled into a structured context block and passed to DeepSeek-V3 with the following LLM configuration:
| Parameter | Value | Rationale |
|---|---|---|
| Model | deepseek-chat (V3) | Cost-effective with strong reasoning capabilities |
| Temperature | 0.3 | Low temperature for factual, deterministic SOP answers |
| Max Tokens | 1,500 | Sufficient for detailed procedural explanations |
| System Prompt | Context-grounded | Constrains answers to provided SOP context only |
The system prompt explicitly instructs the LLM to answer only based on the provided context, preventing hallucination of procedures not present in the indexed SOPs. Every generated answer includes source attribution identifying the originating document.
Enterprise SOP systems handle sensitive operational procedures that require appropriate security measures. RAG SOP Assistant implements multiple security layers:
html.escape() before any processing or display, preventing cross-site scripting attacks.
The DeepSeek API key is stored as an environment variable (Hugging Face Secrets for cloud deployment, DEEPSEEK_API_KEY for local). The key is never logged, included in error messages, or exposed through the web interface. Error sanitization ensures that API-related failures return generic error messages to the user.
The application uses a global threading.Lock to protect ChromaDB collection initialization, ensuring safe concurrent access in multi-user environments. The get_collection() function implements lazy initialization with thread-safe locking, preventing race conditions during the first query after startup.
| Layer | Technology | Purpose |
|---|---|---|
| Web Framework | Gradio 5.9.1 | Interactive web UI with custom CSS theming |
| Language Model | DeepSeek-V3 (via OpenAI-compatible API) | Answer generation from retrieved SOP context |
| Embeddings | intfloat/multilingual-e5-small | 384-dimensional multilingual semantic vectors |
| Vector Database | ChromaDB (persistent mode) | Cosine similarity search over document embeddings |
| PDF Extraction | PyMuPDF (fitz) | Page-level text extraction from PDF documents |
| Word Extraction | python-docx | Paragraph-level text extraction from DOCX files |
| API Client | OpenAI SDK | DeepSeek API communication (OpenAI-compatible) |
| Validation | Pydantic 2.10.6 | Input validation and type checking |
| Language | Python 3.10+ | Application logic and pipeline orchestration |
The Gradio-based interface is organized into three functional tabs:
The interface features custom CSS with the Plus Jakarta Sans font, a teal/blue gradient header, rounded corners, hover effects, and responsive layout for mobile and desktop usage.
The system includes five default SOP documents covering company procedures, staff training, technology usage, storage and maintenance, and emergency protocols. These documents are automatically indexed when the application starts, providing immediate Q&A capability without requiring manual upload. The auto-indexing process checks for existing entries to avoid duplicate indexing on subsequent restarts.
The recommended deployment path provides zero-configuration cloud hosting:
romizone/RAG-SOPDEEPSEEK_API_KEY to Space SecretsRecommended hardware: CPU Basic (2 vCPU, 16GB RAM). The E5-Small embedding model (~470MB) loads at startup and requires approximately 30–60 seconds for initialization.
git clone https://github.com/romizone/RAGSOP.git cd RAGSOP pip install -r requirements.txt export DEEPSEEK_API_KEY="your-api-key-here" python app.py
The application launches at http://localhost:7860 with full functionality including auto-indexing of default SOP documents.
| Metric | Value |
|---|---|
| Startup Time (including model loading) | ~30–60 seconds |
| Default SOP Documents | 5 PDFs (~256 chunks) |
| Embedding Model Size | ~470 MB |
| Embedding Dimensions | 384 |
| Query Response Time | ~3–5 seconds |
| Top-K Retrieval | 5 chunks per query |
| Recommended Hardware | 2 vCPU, 16GB RAM |
RAG SOP Assistant is designed for enterprise environments where rapid access to procedural knowledge is critical:
RAGSOP/
├── app.py # Main Gradio application (~30 KB)
├── requirements.txt # Python dependencies
├── README.md # Comprehensive documentation
├── .gitignore # Git ignore rules
└── SOP/ # Default SOP documents
├── Kumpulan_SOP_Perusahaan.pdf # Company SOP Collection
├── Pelatihan staf_8.pdf # Staff Training Procedures
├── Penggunaan teknologi_7.pdf # Technology Usage Guidelines
├── Penyimpanan dan pemeliharaan_4.pdf # Storage & Maintenance
└── SOP darurat_5.pdf # Emergency Procedures
RAG SOP Assistant demonstrates that Retrieval-Augmented Generation provides an effective approach to transforming static corporate SOP documents into dynamic, conversational knowledge bases. The combination of multilingual semantic embeddings, sentence-boundary-aware chunking, and context-grounded LLM generation enables accurate, source-attributed answers that employees can verify against original documents.
The system's enterprise-ready implementation—with thread-safe operations, XSS prevention, input validation, and automatic document indexing—addresses the practical requirements of organizational deployment. The three-tab Gradio interface provides a complete workflow from document upload through database management to conversational Q&A, all accessible through a web browser without local installation.
Future directions include:
The complete source code is available at https://github.com/romizone/RAGSOP and a live demo is accessible at https://huggingface.co/spaces/romizone/RAG-SOP.