RAG SOP Assistant: AI-Powered Q&A System for Corporate Standard Operating Procedure Documents

Abstract

We present RAG SOP Assistant, an AI-powered question-answering system that transforms static corporate Standard Operating Procedure (SOP) documents into a dynamic, conversational knowledge base. Built on a Retrieval-Augmented Generation (RAG) architecture, the system enables employees to ask natural language questions about company procedures and receive accurate, source-attributed answers grounded in official SOP documents. The pipeline processes PDF, DOCX, and TXT files through intelligent sentence-boundary-aware chunking (~500 characters with 100-character overlap), generates 384-dimensional semantic embeddings using the multilingual E5-Small model, and stores them in a ChromaDB vector database with cosine similarity search. At query time, the system retrieves the top-5 most relevant document chunks and passes them as context to DeepSeek-V3 for answer generation. The application features a three-tab Gradio interface for Q&A chat, document upload, and database management, with thread-safe operations, XSS prevention, input validation, and automatic indexing of default SOP documents at startup. We describe the architecture, chunking strategy, embedding pipeline, security model, and deployment options for this enterprise-ready knowledge management tool.

1. Introduction

Standard Operating Procedures (SOPs) form the backbone of organizational knowledge management. These documents codify critical business processes—from emergency response protocols to staff training procedures, technology usage guidelines, and equipment maintenance workflows. However, in practice, SOP documents are frequently lengthy, distributed across multiple files, and written in formal procedural language that makes rapid information retrieval challenging for employees who need quick, actionable answers.

The traditional approach to SOP consultation involves manual document search: employees must identify the correct document, navigate to the relevant section, and interpret the procedural language in context. This process is time-consuming, error-prone, and particularly problematic during high-pressure situations where quick access to emergency procedures is critical. Keyword-based search tools offer incremental improvement but fail to capture semantic meaning—a search for "what to do during a fire" will not match a section titled "Emergency Evacuation Protocol."

RAG SOP Assistant addresses these limitations by combining semantic search with large language model (LLM) generation. The system enables employees to ask questions in natural, conversational language and receive contextually accurate answers with explicit source attribution to the originating SOP document. The key contributions of this work are:

2. Related Work

Retrieval-Augmented Generation (Lewis et al., 2020) introduced the paradigm of combining document retrieval with neural text generation, demonstrating significant improvements over purely parametric models on knowledge-intensive tasks. The RAG approach addresses the fundamental limitation of LLMs—that their knowledge is frozen at training time—by grounding generation in retrieved evidence from an external knowledge base.

Vector databases such as ChromaDB (Trychroma, 2023), Pinecone, and Qdrant have emerged as purpose-built storage systems for embedding-based retrieval, offering efficient approximate nearest neighbor (ANN) search over high-dimensional vector spaces. ChromaDB distinguishes itself through its lightweight, embeddable design that supports both in-memory and persistent storage modes, making it suitable for applications ranging from prototypes to production deployments.

Multilingual embedding models, particularly the E5 family (Wang et al., 2022), have demonstrated strong cross-lingual semantic similarity performance. The E5-Small variant used in this work provides 384-dimensional embeddings with support for 100+ languages, balancing retrieval quality with computational efficiency—a critical consideration for deployments on resource-constrained hardware such as Hugging Face Spaces.

Prior work on document Q&A systems has largely focused on general-purpose applications. RAG SOP Assistant specializes in the corporate SOP domain, where documents follow structured procedural formats and answers must be traceable to authoritative sources—requirements that demand careful attention to chunking strategy, source preservation, and answer grounding.

3. System Architecture

RAG SOP Assistant follows a modular pipeline architecture comprising four stages: document ingestion, text chunking and embedding, vector storage, and retrieval-augmented generation. Each stage is designed for reliability, thread safety, and enterprise deployment.

3.1 RAG Pipeline Overview

SOP Document Upload → Text Extraction (PDF/DOCX/TXT) → Sentence-Boundary Chunking (~500 chars) → E5-Small Embedding (384d) → ChromaDB Vector Storage → Cosine Similarity Search (Top-5) → Context Assembly → DeepSeek-V3 Generation → Answer + Source Attribution

3.2 Document Ingestion

The system supports three document formats commonly used for SOPs in Indonesian corporate environments:

Upon upload, files are validated against size limits (50MB maximum) and format whitelist. The extraction process is wrapped in comprehensive error handling with sanitized error messages that never expose internal paths or API credentials.

3.3 Intelligent Text Chunking

Text chunking is a critical determinant of RAG system quality. Chunks that are too large dilute the semantic signal; chunks that are too small lose contextual coherence. RAG SOP Assistant implements a sentence-boundary-aware chunking strategy with the following parameters:

The chunking algorithm first attempts to split text at natural sentence boundaries (periods, exclamation marks, question marks, newlines). If no suitable boundary is found within the target window, it falls back to character-based splitting. This approach is particularly important for SOP documents, where procedural steps often span multiple sentences and should not be split mid-instruction.

3.4 Embedding Generation

Table 1: Supported document formats and extraction methods
Format	Extension	Extraction Engine	Notes
PDF	.pdf	PyMuPDF (fitz)	Page-by-page text extraction with layout preservation
Word	.docx	python-docx	Paragraph-level text extraction
Plain Text	.txt	UTF-8 read	Direct content ingestion

Table 2: Chunking configuration
Parameter	Value	Rationale
Chunk Size	~500 characters	Balances semantic density with contextual completeness
Chunk Overlap	100 characters	Ensures continuity across chunk boundaries
Boundary Detection	Sentence-aware (. ! ? \n)	Preserves semantic coherence of procedural steps
Boundary Validation	>30% of chunk_size	Falls back to character-based split if no good boundary exists

Each text chunk is converted to a 384-dimensional vector using the intfloat/multilingual-e5-small embedding model from the sentence-transformers library. The E5-Small model was selected for three reasons:

3.5 Vector Storage with ChromaDB

Embeddings are stored in a ChromaDB persistent collection configured with cosine similarity as the distance metric. Each chunk is stored with associated metadata:

The use of MD5-based chunk IDs enables idempotent insertion—re-uploading the same document does not create duplicate entries. The PersistentClient stores data at ./chroma_db, ensuring that the vector index survives application restarts.

3.6 Retrieval and Generation

At query time, the user's question is embedded using the same E5-Small model and searched against the ChromaDB collection using cosine similarity. The top-5 most similar chunks are retrieved along with their source metadata. These chunks are assembled into a structured context block and passed to DeepSeek-V3 with the following LLM configuration:

The system prompt explicitly instructs the LLM to answer only based on the provided context, preventing hallucination of procedures not present in the indexed SOPs. Every generated answer includes source attribution identifying the originating document.

4. Security Model

Enterprise SOP systems handle sensitive operational procedures that require appropriate security measures. RAG SOP Assistant implements multiple security layers:

4.1 Input Validation and Sanitization

4.2 API Key Protection

Table 3: LLM generation parameters
Parameter	Value	Rationale
Model	deepseek-chat (V3)	Cost-effective with strong reasoning capabilities
Temperature	0.3	Low temperature for factual, deterministic SOP answers
Max Tokens	1,500	Sufficient for detailed procedural explanations
System Prompt	Context-grounded	Constrains answers to provided SOP context only

The DeepSeek API key is stored as an environment variable (Hugging Face Secrets for cloud deployment, DEEPSEEK_API_KEY for local). The key is never logged, included in error messages, or exposed through the web interface. Error sanitization ensures that API-related failures return generic error messages to the user.

4.3 Thread Safety

The application uses a global threading.Lock to protect ChromaDB collection initialization, ensuring safe concurrent access in multi-user environments. The get_collection() function implements lazy initialization with thread-safe locking, preventing race conditions during the first query after startup.

5. Implementation Details

5.1 Technology Stack

5.2 User Interface

The interface features custom CSS with the Plus Jakarta Sans font, a teal/blue gradient header, rounded corners, hover effects, and responsive layout for mobile and desktop usage.

5.3 Auto-Indexing at Startup

The system includes five default SOP documents covering company procedures, staff training, technology usage, storage and maintenance, and emergency protocols. These documents are automatically indexed when the application starts, providing immediate Q&A capability without requiring manual upload. The auto-indexing process checks for existing entries to avoid duplicate indexing on subsequent restarts.

6. Deployment Options

6.1 Hugging Face Spaces (Recommended)

Recommended hardware: CPU Basic (2 vCPU, 16GB RAM). The E5-Small embedding model (~470MB) loads at startup and requires approximately 30–60 seconds for initialization.

6.2 Local Development

Table 4: Core technology stack
Layer	Technology	Purpose
Web Framework	Gradio 5.9.1	Interactive web UI with custom CSS theming
Language Model	DeepSeek-V3 (via OpenAI-compatible API)	Answer generation from retrieved SOP context
Embeddings	intfloat/multilingual-e5-small	384-dimensional multilingual semantic vectors
Vector Database	ChromaDB (persistent mode)	Cosine similarity search over document embeddings
PDF Extraction	PyMuPDF (fitz)	Page-level text extraction from PDF documents
Word Extraction	python-docx	Paragraph-level text extraction from DOCX files
API Client	OpenAI SDK	DeepSeek API communication (OpenAI-compatible)
Validation	Pydantic 2.10.6	Input validation and type checking
Language	Python 3.10+	Application logic and pipeline orchestration

The application launches at http://localhost:7860 with full functionality including auto-indexing of default SOP documents.

7. Performance Characteristics

8. Use Cases

RAG SOP Assistant is designed for enterprise environments where rapid access to procedural knowledge is critical:

9. Project Structure

10. Conclusion and Future Work

RAG SOP Assistant demonstrates that Retrieval-Augmented Generation provides an effective approach to transforming static corporate SOP documents into dynamic, conversational knowledge bases. The combination of multilingual semantic embeddings, sentence-boundary-aware chunking, and context-grounded LLM generation enables accurate, source-attributed answers that employees can verify against original documents.

The system's enterprise-ready implementation—with thread-safe operations, XSS prevention, input validation, and automatic document indexing—addresses the practical requirements of organizational deployment. The three-tab Gradio interface provides a complete workflow from document upload through database management to conversational Q&A, all accessible through a web browser without local installation.

References

Table 5: System performance metrics
Metric	Value
Startup Time (including model loading)	~30–60 seconds
Default SOP Documents	5 PDFs (~256 chunks)
Embedding Model Size	~470 MB
Embedding Dimensions	384
Query Response Time	~3–5 seconds
Top-K Retrieval	5 chunks per query
Recommended Hardware	2 vCPU, 16GB RAM

Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33.
Wang, L., Yang, N., Huang, X., et al. (2022). Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv preprint arXiv:2212.03533.
Trychroma (2023). ChromaDB: The AI-Native Open-Source Embedding Database. GitHub Repository. https://github.com/chroma-core/chroma
DeepSeek AI (2024). DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437.
Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.
Abdin, M., et al. (2024). Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild. ICML 2019 Demo Track.
Guo, M., et al. (2020). Multireqa: A Benchmark for Evaluating Open-Domain Multi-Document Question Answering. Findings of the Association for Computational Linguistics.
Johnson, J., Douze, M., and Jégou, H. (2019). Billion-Scale Similarity Search with GPUs. IEEE Transactions on Big Data, 7(3), 535-547.