RAG: Your private detective and teacher in one
- Aju John
- Apr 29
- 6 min read
Updated: Jun 10

From Clues to Classrooms: Build a Document-Solving RAG Pipeline to transform unstructured chaos into structured insights using retrieval-augmented AI, while offline to prevent your data from going public.
What if you could unleash the transformative power of artificial intelligence on your personal documents—instantly, securely, and entirely offline? Picture AI that not only understands your files but also retrieves and generates answers with precision, thanks to cutting-edge Retrieval-Augmented Generation (RAG). No internet. No data leaving your device. Just you, your documents, and the next level of private, intelligent assistance—right at your fingertips. Combined with a local LLM and a nifty Python program (script), you are only limited by the hardware resouces you can throw at it.
This computer program I wrote, acts like a study buddy for your documents. It can read a document (such as a textbook or instruction manual), break it into smaller pieces, and remember the important parts. When you ask it a question, it quickly finds the most relevant sections and uses a smart chatbot (similar to ChatGPT, but running on your own computer) to give you clear answers based on the document's content. It's like having a detective that searches through your files and a teacher that explains what they mean!

The program first uses a PDF reader to extract text, then chops it into smaller sections so the computer can process them more efficiently. It converts these text chunks into numerical codes (called "embeddings") that capture their meaning, storing them in a special database optimized for quick searches. When you ask a question, the system finds the most relevant text sections by comparing numerical patterns, then feeds these to an AI language model (Llama3) that’s been fine-tuned to generate clear answers while staying grounded in the original document's content. The entire process combines database management, machine learning, and natural language processing to mimic how humans might research and synthesize information.
This document describes my script that implements a RAG (Retrieval-Augmented Generation) system that processes PDF documents for question-answering through multiple integrated technologies:
It uses:
PyPDFLoader from LangChain to load PDFs,
RecursiveCharacterTextSplitter to chunk text,
HuggingFaceEmbeddings (via the updated langchain-huggingface package) for generating GPU-accelerated embeddings on an NVIDIA RTX 2080.
The embeddings and documents are stored in Milvus, a vector database running in a Docker container, optimized with GPU indexing (GPU_IVF_FLAT).
Queries leverage Ollama-hosted Llama3 (a large language model) to generate answers using retrieved context.
The workflow combines LangChain's document processing, Milvus's vector search, and Ollama's local LLM inference for an end-to-end, GPU-enhanced RAG pipeline.
This Python script implements a Retrieval-Augmented Generation (RAG) system optimized for processing PDF documents, combining multiple modern AI technologies into an end-to-end pipeline.
Below is the technical analysis:

Assuming there is a stand-alone Milvus vector DB running in a docker container and you have deployed Ollama running Llama3, here are the steps for the RAG pipeline:
1. Document Processing
· Uses PyPDFLoader from LangChain to extract text from PDF files
· Implements RecursiveCharacterTextSplitter with 1000-character chunks and 200-character overlap for context preservation
2. Embedding Generation
· Employs sentence-transformers/all-mpnet-base-v2 model via HuggingFaceEmbeddings
· Supports GPU acceleration (commented for CPU use in current configuration)
3. Vector Storage
· Utilizes Milvus vector database with Docker deployment
· Configures GPU_IVF_FLAT index for optimized similarity search
· Implements automatic collection management (drop_old=True)
4. Language Model Integration
· Leverages Ollama-hosted Llama3 model for response generation
· Uses temperature 0.3 for balanced creativity/accuracy
Key Technologies
Technology | Role | Implementation Details |
LangChain | Document processing pipeline | Chunking and PDF extraction |
Milvus | Vector database | GPU-optimized indexing |
Hugging Face | Text embeddings | Sentence Transformers model |
Ollama | LLM inference | Local Llama3 deployment |
PyPDF2 | PDF text extraction | Underlying PDF parser |
Dependencies
Core requirements:
langchain-community
pymilvus
ollama
PyPDF2
sentence-transformers
Details of Coding
The program needs to import these pre-requisites to be able to run:
import ollama
from pymilvus import connections, utility
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Milvus
Configuration
COLLECTION_NAME = "pdf_collection"
PDF_PATH = "C:/Users/user/Documents/P10B.pdf". # Replace with path to your document
1. Initialize Embeddings (correct dimension handling)
embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-mpnet-base-v2", model_kwargs={'device': 'cpu'} # Add CUDA if available)
2. Document Processing
loader = PyPDFLoader(PDF_PATH)text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, length_function=len)chunks = text_splitter.split_documents(loader.load())
3. Milvus Setup with LangChain Integration
vector_store = Milvus.from_documents( documents=chunks, embedding=embeddings, collection_name=COLLECTION_NAME, connection_args={ "host": "localhost", "port": "19530"},drop_old=True)
4. Verify Collection Creation
connections.connect(host="localhost", port="19530")
print("Collections in DB:", utility.list_collections())
5. RAG Query Function
def rag_query (question: str, top_k: int = 3): # Get vector store reference
vector_store = Milvus( embedding_function=embeddings, collection_name = COLLECTION_NAME, connection_args={"host": "localhost", "port": "19530"} )
The query entered by the user is then used to search the vector database, using similarity search to get the context for the Llama3 LLM search, which is the augmentation to the search.
# Semantic Search
docs = vector_store.similarity_search( query=question, k=top_k)
context = "\n\n".join(doc.page_content for doc in docs)
response = ollama.generate( model='llama3', prompt=f"Context:\n{context}\n\nQuestion: {question}\nAnswer:", options={'temperature': 0.3})
return response['response']
Use Cases
1. Document Q&A Systems
o Technical manuals analysis
o Legal document review
o Academic research assistance
2. Enterprise Applications
o Internal knowledge base search
o Customer support automation
o Compliance document analysis
3. Specialized Implementations
o Genealogical research (as used for family tree analysis)
o Medical record processing
o Financial report analysis
Performance Considerations
· GPU Utilization: Current configuration uses CPU for embeddings but supports NVIDIA GPUs via CUDA
· Milvus Optimization: GPU_IVF_FLAT index balances speed and accuracy for medium-sized datasets
· Chunking Strategy: 1000-character chunks with overlap maintains contextual integrity
Other Enhancements
I enhanced the RAG pipeline by integrating a WebUI using the Streamlit framework, making it much easier for users to interact with the system and visualize retrieval-augmented generation results in real time.

To further improve efficiency, I implemented idempotent vector embeddings so that if the same document is processed multiple times, its embedding is generated only once and reused, reducing redundant computation and storage needs. Additionally, I added support for managing multiple collections of documents within the vector database, allowing users to easily switch between collections. This capability ensures that context and retrieval remain grounded in a specific area or domain, enabling more focused and relevant responses for different use cases. Collectively, these enhancements make the RAG pipeline more user-friendly, efficient, and adaptable to a variety of document management scenarios
Here’s a useful technique: You can seamlessly inject custom knowledge into your RAG pipeline by simply preparing a new document—such as a PDF—containing your own facts or domain-specific information. After saving your content, integrate the document into your vector database indexing process. This allows the retrieval component of your RAG system to access and incorporate these facts during search operations.
By doing so, you can:
• Expand your document corpus: Augment your existing collection with additional, curated data points or domain knowledge.
• Test retrieval and augmentation accuracy: Verify that your pipeline correctly retrieves and utilizes the new information by querying the system with relevant prompts.
• Enhance context generation: Ensure your language model (LLM) receives the most pertinent contextual data for improved response quality.
Environment Setup needed for RAG Pipeline
To set up a Retrieval-Augmented Generation (RAG) pipeline on Windows using Milvus as the vector database, Ollama running Llama 3 as the language model, and Python for orchestration, first install Python 3.11 and Visual Studio Code, then create a project directory for your scripts and data. Next, install Docker Desktop, pull the Milvus Docker image, and launch Milvus as a standalone service, installing the PyMilvus SDK to test connectivity. For the language model, download and install Ollama, pull the Llama 3 model, and start Ollama as a background service. Install all necessary Python dependencies (including langchain, pymilvus, ollama, and streamlit), and optionally set up GPU support for PyTorch if using hardware acceleration. Prepare your documents, generate embeddings with a script to populate Milvus, and finally launch the chatbot web interface using Streamlit for interactive Q&A. Ensure both Ollama and Milvus are running in the background, and keep your drivers and libraries updated for compatibility as software versions evolve

Development Context
Created by Aju John for processing the PULICKAPARAMB (P10B) Family Tree PDF, this implementation demonstrates a production-ready RAG pipeline that could scale to handle thousands of documents with appropriate hardware resources. The modular design allows easy substitution of components (e.g., swapping Milvus for ChromaDB) while maintaining core functionality.
Demo
Here is a dialog with the completed RAG pipeline for the above context:
Question: What is the document about?
Processing...
Answer: The document appears to be a family tree or genealogy of many families, detailing the birth dates, occupations, marriages, and children of various individuals. It seems to be a comprehensive record of the family's history and relationships.
Question: what are the names of different places mentioned in the document?
Processing...
Answer: The names of different places mentioned in the document are:
1. Parambil house
2. Aymanam
3. Areeparambu
4. Olesha
5. Thiruvalla
6. Alwaye
7. Quilon
8. Trivandrum
9. Kottayam
10. Kuwait
11. Nelliampathy
12. Manalaeparambil
13. Thazhathangadi
Other questions can be:
Question: How many generations does the genealogy in this document span?
Question: Who is the oldest living member in the document?
Question: Who is the youngest member in the document?
Interested in building a RAG pipeline for your confidential data? Get in touch!
Disclaimer: This setup worked as of May 27, 2025 and not guaranteed to work verbatim in the future as versions change and is meant to serve as a guideline based on how it used to work on this day. Interdependencies are version dependent. The system essentially "guesses" answers probabilistically but grounds these guesses in retrieved facts, so while they may be factual, it may not be complete or accurate. Use it at your own risk.





Comments