What you'll build
A RAG (Retrieval-Augmented Generation) system that answers questions about your own PDF documents.
You load your PDFs (contracts, manuals, documentation), and you can ask things like "What's the return policy?" or "What does article 5 say?". The system searches your documents, finds the relevant information, and generates a precise answer using AI.
When finished, you'll have a Python script that processes PDFs, creates embeddings with Gemini, stores them in ChromaDB, and answers questions based on the actual content of your documents.
The prompt to start
Create a RAG system in Python that:
- Loads PDFs from a folder
- Splits them into 500-character chunks
- Creates embeddings with Gemini
- Stores in ChromaDB
- Allows asking questions and gets relevant context
- Uses Gemini to answer based on context
What the AI will create
from PyPDF2 import PdfReader
import chromadb
from chromadb.utils import embedding_functions
import google.generativeai as genai
import os
# Configure
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel("gemini-1.5-flash")
# ChromaDB with Gemini embeddings
client = chromadb.Client()
gemini_ef = embedding_functions.GoogleGenerativeAiEmbeddingFunction(
api_key=os.environ["GEMINI_API_KEY"]
)
collection = client.create_collection("docs", embedding_function=gemini_ef)
def load_pdfs(folder: str) -> list[str]:
"""Load all PDFs from a folder"""
chunks = []
for file in os.listdir(folder):
if file.endswith(".pdf"):
reader = PdfReader(os.path.join(folder, file))
text = "".join(page.extract_text() for page in reader.pages)
# Split into chunks
for i in range(0, len(text), 500):
chunks.append(text[i:i+500])
return chunks
def index_documents(folder: str):
"""Index documents in ChromaDB"""
chunks = load_pdfs(folder)
collection.add(
documents=chunks,
ids=[f"chunk_{i}" for i in range(len(chunks))]
)
print(f"Indexed {len(chunks)} chunks")
def ask(question: str) -> str:
"""Ask the RAG"""
# Search relevant context
results = collection.query(
query_texts=[question],
n_results=3
)
context = "\n---\n".join(results["documents"][0])
# Generate response with context
prompt = f"""Answer using ONLY information from the context.
If not in context, say "I don't have that information".
CONTEXT:
{context}
QUESTION: {question}"""
response = model.generate_content(prompt)
return response.text
# Usage
index_documents("./my_documents")
answer = ask("What is the onboarding process?")
print(answer)
Installation
pip install PyPDF2 chromadb google-generativeai
Suggested improvements
| Improvement | Description |
|---|---|
| Semantic chunks | Split by paragraphs |
| Metadata | Store filename |
| Reranking | Reorder results |
| Streaming | Real-time response |
Next level
โ Vector Search