What you'll build

A RAG (Retrieval-Augmented Generation) system that answers questions about your own PDF documents.

You load your PDFs (contracts, manuals, documentation), and you can ask things like "What's the return policy?" or "What does article 5 say?". The system searches your documents, finds the relevant information, and generates a precise answer using AI.

When finished, you'll have a Python script that processes PDFs, creates embeddings with Gemini, stores them in ChromaDB, and answers questions based on the actual content of your documents.

The prompt to start

Create a RAG system in Python that:

Loads PDFs from a folder

Splits them into 500-character chunks

Creates embeddings with Gemini

Stores in ChromaDB

Allows asking questions and gets relevant context

Uses Gemini to answer based on context

What the AI will create

from PyPDF2 import PdfReader
import chromadb
from chromadb.utils import embedding_functions
import google.generativeai as genai
import os

# Configure
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel("gemini-1.5-flash")

# ChromaDB with Gemini embeddings
client = chromadb.Client()
gemini_ef = embedding_functions.GoogleGenerativeAiEmbeddingFunction(
    api_key=os.environ["GEMINI_API_KEY"]
)
collection = client.create_collection("docs", embedding_function=gemini_ef)

def load_pdfs(folder: str) -> list[str]:
    """Load all PDFs from a folder"""
    chunks = []
    for file in os.listdir(folder):
        if file.endswith(".pdf"):
            reader = PdfReader(os.path.join(folder, file))
            text = "".join(page.extract_text() for page in reader.pages)
            # Split into chunks
            for i in range(0, len(text), 500):
                chunks.append(text[i:i+500])
    return chunks

def index_documents(folder: str):
    """Index documents in ChromaDB"""
    chunks = load_pdfs(folder)
    collection.add(
        documents=chunks,
        ids=[f"chunk_{i}" for i in range(len(chunks))]
    )
    print(f"Indexed {len(chunks)} chunks")

def ask(question: str) -> str:
    """Ask the RAG"""
    # Search relevant context
    results = collection.query(
        query_texts=[question],
        n_results=3
    )
    context = "\n---\n".join(results["documents"][0])

    # Generate response with context
    prompt = f"""Answer using ONLY information from the context.
If not in context, say "I don't have that information".

CONTEXT:
{context}

QUESTION: {question}"""

    response = model.generate_content(prompt)
    return response.text

# Usage
index_documents("./my_documents")
answer = ask("What is the onboarding process?")
print(answer)

Installation

pip install PyPDF2 chromadb google-generativeai

Suggested improvements

Improvement	Description
Semantic chunks	Split by paragraphs
Metadata	Store filename
Reranking	Reorder results
Streaming	Real-time response

Next level

→ Vector Search

RAG with PDF Documents