RAG: Your private detective and teacher in one

Aju John
Apr 29
6 min read

Updated: Jun 10

From Clues to Classrooms: Build a Document-Solving RAG Pipeline to transform unstructured chaos into structured insights using retrieval-augmented AI, while offline to prevent your data from going public.

What if you could unleash the transformative power of artificial intelligence on your personal documents—instantly, securely, and entirely offline? Picture AI that not only understands your files but also retrieves and generates answers with precision, thanks to cutting-edge Retrieval-Augmented Generation (RAG). No internet. No data leaving your device. Just you, your documents, and the next level of private, intelligent assistance—right at your fingertips. Combined with a local LLM and a nifty Python program (script), you are only limited by the hardware resouces you can throw at it.

This computer program I wrote, acts like a study buddy for your documents. It can read a document (such as a textbook or instruction manual), break it into smaller pieces, and remember the important parts. When you ask it a question, it quickly finds the most relevant sections and uses a smart chatbot (similar to ChatGPT, but running on your own computer) to give you clear answers based on the document's content. It's like having a detective that searches through your files and a teacher that explains what they mean!

The program first uses a PDF reader to extract text, then chops it into smaller sections so the computer can process them more efficiently. It converts these text chunks into numerical codes (called "embeddings") that capture their meaning, storing them in a special database optimized for quick searches. When you ask a question, the system finds the most relevant text sections by comparing numerical patterns, then feeds these to an AI language model (Llama3) that’s been fine-tuned to generate clear answers while staying grounded in the original document's content. The entire process combines database management, machine learning, and natural language processing to mimic how humans might research and synthesize information.

This document describes my script that implements a RAG (Retrieval-Augmented Generation) system that processes PDF documents for question-answering through multiple integrated technologies:

It uses:

PyPDFLoader from LangChain to load PDFs,
RecursiveCharacterTextSplitter to chunk text,
HuggingFaceEmbeddings (via the updated langchain-huggingface package) for generating GPU-accelerated embeddings on an NVIDIA RTX 2080.
The embeddings and documents are stored in Milvus, a vector database running in a Docker container, optimized with GPU indexing (GPU_IVF_FLAT).
Queries leverage Ollama-hosted Llama3 (a large language model) to generate answers using retrieved context.
The workflow combines LangChain's document processing, Milvus's vector search, and Ollama's local LLM inference for an end-to-end, GPU-enhanced RAG pipeline.

This Python script implements a Retrieval-Augmented Generation (RAG) system optimized for processing PDF documents, combining multiple modern AI technologies into an end-to-end pipeline.

Below is the technical analysis:

Assuming there is a stand-alone Milvus vector DB running in a docker container and you have deployed Ollama running Llama3, here are the steps for the RAG pipeline:

1. Document Processing

· Uses PyPDFLoader from LangChain to extract text from PDF files

· Implements RecursiveCharacterTextSplitter with 1000-character chunks and 200-character overlap for context preservation

2. Embedding Generation

· Employs sentence-transformers/all-mpnet-base-v2 model via HuggingFaceEmbeddings

· Supports GPU acceleration (commented for CPU use in current configuration)

3. Vector Storage

· Utilizes Milvus vector database with Docker deployment

· Configures GPU_IVF_FLAT index for optimized similarity search

· Implements automatic collection management (drop_old=True)

4. Language Model Integration

· Leverages Ollama-hosted Llama3 model for response generation

· Uses temperature 0.3 for balanced creativity/accuracy

Key Technologies

Technology	Role	Implementation Details
LangChain	Document processing pipeline	Chunking and PDF extraction
Milvus	Vector database	GPU-optimized indexing
Hugging Face	Text embeddings	Sentence Transformers model
Ollama	LLM inference	Local Llama3 deployment
PyPDF2	PDF text extraction	Underlying PDF parser

Dependencies

Core requirements:

langchain-community

pymilvus

ollama

PyPDF2

sentence-transformers

Details of Coding

The program needs to import these pre-requisites to be able to run:

import ollama

from pymilvus import connections, utility

from langchain_community.document_loaders import PyPDFLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_community.embeddings import HuggingFaceEmbeddings

from langchain_community.vectorstores import Milvus

Configuration

COLLECTION_NAME = "pdf_collection"

PDF_PATH = "C:/Users/user/Documents/P10B.pdf". # Replace with path to your document

1. Initialize Embeddings (correct dimension handling)

embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-mpnet-base-v2", model_kwargs={'device': 'cpu'} # Add CUDA if available)

2. Document Processing

loader = PyPDFLoader(PDF_PATH)text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, length_function=len)chunks = text_splitter.split_documents(loader.load())

3. Milvus Setup with LangChain Integration

vector_store = Milvus.from_documents( documents=chunks, embedding=embeddings, collection_name=COLLECTION_NAME, connection_args={ "host": "localhost", "port": "19530"},drop_old=True)

4. Verify Collection Creation

connections.connect(host="localhost", port="19530")

print("Collections in DB:", utility.list_collections())

5. RAG Query Function

def rag_query (question: str, top_k: int = 3): # Get vector store reference

vector_store = Milvus( embedding_function=embeddings, collection_name = COLLECTION_NAME, connection_args={"host": "localhost", "port": "19530"} )

The query entered by the user is then used to search the vector database, using similarity search to get the context for the Llama3 LLM search, which is the augmentation to the search.

# Semantic Search

docs = vector_store.similarity_search( query=question, k=top_k)

context = "\n\n".join(doc.page_content for doc in docs)

response = ollama.generate( model='llama3', prompt=f"Context:\n{context}\n\nQuestion: {question}\nAnswer:", options={'temperature': 0.3})

return response['response']

Use Cases

1. Document Q&A Systems

o Technical manuals analysis

o Legal document review

o Academic research assistance

2. Enterprise Applications

o Internal knowledge base search

o Customer support automation

o Compliance document analysis

3. Specialized Implementations

o Genealogical research (as used for family tree analysis)

o Medical record processing

o Financial report analysis

Performance Considerations

· GPU Utilization: Current configuration uses CPU for embeddings but supports NVIDIA GPUs via CUDA

· Milvus Optimization: GPU_IVF_FLAT index balances speed and accuracy for medium-sized datasets

· Chunking Strategy: 1000-character chunks with overlap maintains contextual integrity

Other Enhancements

I enhanced the RAG pipeline by integrating a WebUI using the Streamlit framework, making it much easier for users to interact with the system and visualize retrieval-augmented generation results in real time.

To further improve efficiency, I implemented idempotent vector embeddings so that if the same document is processed multiple times, its embedding is generated only once and reused, reducing redundant computation and storage needs. Additionally, I added support for managing multiple collections of documents within the vector database, allowing users to easily switch between collections. This capability ensures that context and retrieval remain grounded in a specific area or domain, enabling more focused and relevant responses for different use cases. Collectively, these enhancements make the RAG pipeline more user-friendly, efficient, and adaptable to a variety of document management scenarios

Here’s a useful technique: You can seamlessly inject custom knowledge into your RAG pipeline by simply preparing a new document—such as a PDF—containing your own facts or domain-specific information. After saving your content, integrate the document into your vector database indexing process. This allows the retrieval component of your RAG system to access and incorporate these facts during search operations.

By doing so, you can:

• Expand your document corpus: Augment your existing collection with additional, curated data points or domain knowledge.

• Test retrieval and augmentation accuracy: Verify that your pipeline correctly retrieves and utilizes the new information by querying the system with relevant prompts.

• Enhance context generation: Ensure your language model (LLM) receives the most pertinent contextual data for improved response quality.

Environment Setup needed for RAG Pipeline

To set up a Retrieval-Augmented Generation (RAG) pipeline on Windows using Milvus as the vector database, Ollama running Llama 3 as the language model, and Python for orchestration, first install Python 3.11 and Visual Studio Code, then create a project directory for your scripts and data. Next, install Docker Desktop, pull the Milvus Docker image, and launch Milvus as a standalone service, installing the PyMilvus SDK to test connectivity. For the language model, download and install Ollama, pull the Llama 3 model, and start Ollama as a background service. Install all necessary Python dependencies (including langchain, pymilvus, ollama, and streamlit), and optionally set up GPU support for PyTorch if using hardware acceleration. Prepare your documents, generate embeddings with a script to populate Milvus, and finally launch the chatbot web interface using Streamlit for interactive Q&A. Ensure both Ollama and Milvus are running in the background, and keep your drivers and libraries updated for compatibility as software versions evolve

Development Context

Created by Aju John for processing the PULICKAPARAMB (P10B) Family Tree PDF, this implementation demonstrates a production-ready RAG pipeline that could scale to handle thousands of documents with appropriate hardware resources. The modular design allows easy substitution of components (e.g., swapping Milvus for ChromaDB) while maintaining core functionality.

Demo

Here is a dialog with the completed RAG pipeline for the above context:

Question: What is the document about?

Processing...

Answer: The document appears to be a family tree or genealogy of many families, detailing the birth dates, occupations, marriages, and children of various individuals. It seems to be a comprehensive record of the family's history and relationships.

Question: what are the names of different places mentioned in the document?

Processing...

Answer: The names of different places mentioned in the document are:

1.⁠ ⁠Parambil house

2.⁠ ⁠Aymanam

3.⁠ ⁠Areeparambu

4.⁠ ⁠Olesha

5.⁠ ⁠Thiruvalla

6.⁠ ⁠Alwaye

7.⁠ ⁠Quilon

8.⁠ ⁠Trivandrum

9.⁠ ⁠Kottayam

10.⁠ ⁠Kuwait

11.⁠ ⁠Nelliampathy

12.⁠ ⁠Manalaeparambil

13.⁠ ⁠Thazhathangadi

A D S
Consultancy

RAG: Your private detective and teacher in one

Recent Posts

Comments

A D S Consultancy

Comments

A D S
Consultancy