๐Ÿ“š

RAG with PDF Documents

๐Ÿ‘จโ€๐Ÿณ๐Ÿ‘‘ Master Chefโฑ๏ธ 45 minutes

๐Ÿ“‹ Suggested prerequisites

  • โ€ขPython
  • โ€ขLLM API (Gemini)

What you'll build

A RAG (Retrieval-Augmented Generation) system that answers questions about your own PDF documents.

You load your PDFs (contracts, manuals, documentation), and you can ask things like "What's the return policy?" or "What does article 5 say?". The system searches your documents, finds the relevant information, and generates a precise answer using AI.

When finished, you'll have a Python script that processes PDFs, creates embeddings with Gemini, stores them in ChromaDB, and answers questions based on the actual content of your documents.


The prompt to start

Create a RAG system in Python that:

  1. Loads PDFs from a folder
  2. Splits them into 500-character chunks
  3. Creates embeddings with Gemini
  4. Stores in ChromaDB
  5. Allows asking questions and gets relevant context
  6. Uses Gemini to answer based on context

What the AI will create

from PyPDF2 import PdfReader
import chromadb
from chromadb.utils import embedding_functions
import google.generativeai as genai
import os

# Configure
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel("gemini-1.5-flash")

# ChromaDB with Gemini embeddings
client = chromadb.Client()
gemini_ef = embedding_functions.GoogleGenerativeAiEmbeddingFunction(
    api_key=os.environ["GEMINI_API_KEY"]
)
collection = client.create_collection("docs", embedding_function=gemini_ef)

def load_pdfs(folder: str) -> list[str]:
    """Load all PDFs from a folder"""
    chunks = []
    for file in os.listdir(folder):
        if file.endswith(".pdf"):
            reader = PdfReader(os.path.join(folder, file))
            text = "".join(page.extract_text() for page in reader.pages)
            # Split into chunks
            for i in range(0, len(text), 500):
                chunks.append(text[i:i+500])
    return chunks

def index_documents(folder: str):
    """Index documents in ChromaDB"""
    chunks = load_pdfs(folder)
    collection.add(
        documents=chunks,
        ids=[f"chunk_{i}" for i in range(len(chunks))]
    )
    print(f"Indexed {len(chunks)} chunks")

def ask(question: str) -> str:
    """Ask the RAG"""
    # Search relevant context
    results = collection.query(
        query_texts=[question],
        n_results=3
    )
    context = "\n---\n".join(results["documents"][0])

    # Generate response with context
    prompt = f"""Answer using ONLY information from the context.
If not in context, say "I don't have that information".

CONTEXT:
{context}

QUESTION: {question}"""

    response = model.generate_content(prompt)
    return response.text

# Usage
index_documents("./my_documents")
answer = ask("What is the onboarding process?")
print(answer)

Installation

pip install PyPDF2 chromadb google-generativeai

Suggested improvements

ImprovementDescription
Semantic chunksSplit by paragraphs
MetadataStore filename
RerankingReorder results
StreamingReal-time response

Next level

โ†’ Vector Search