What you'll build
A web scraper with Python that extracts data from websites automatically. Uses requests and BeautifulSoup to make HTTP requests, parse HTML, and save data to files or JSON. Very useful for research and data collection.
Installation
uv add requests beautifulsoup4
# or
pip install requests beautifulsoup4
Step 1: Ask an AI for the scraper
I need a web scraper in Python that:
- Uses requests and BeautifulSoup
- Extracts titles and links from a page
- Handles connection errors
- Saves results to JSON
- Has delay between requests (be respectful)
- Includes appropriate user-agent
Give me the complete code.
Typical code
import requests
from bs4 import BeautifulSoup
import json
import time
def scrape_page(url: str) -> list[dict]:
headers = {
'User-Agent': 'Mozilla/5.0 (educational scraper)'
}
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
except requests.RequestException as e:
print(f"Error: {e}")
return []
soup = BeautifulSoup(response.text, 'html.parser')
results = []
for link in soup.find_all('a', href=True):
results.append({
'text': link.get_text(strip=True),
'href': link['href']
})
return results
# Usage
data = scrape_page('https://example.com')
with open('results.json', 'w') as f:
json.dump(data, f, indent=2)
Best practices
| Practice | Why |
|---|---|
| Delay between requests | Don't overload server |
| User-Agent | Identify yourself |
| robots.txt | Respect site rules |
| Error handling | Handle timeouts and errors |
Scraping ethics
โ ๏ธ Always check the site's terms of service. Some prohibit scraping.
Next level
โ Public AI Chat with Auth โ Chef Level