๐Ÿ•ท๏ธ

Basic Web Scraper

๐Ÿง‘โ€๐Ÿณ Cookโฑ๏ธ 25 minutes

๐Ÿ“‹ Suggested prerequisites

  • โ€ขBasic Python
  • โ€ขBasic HTML

What you'll build

A web scraper with Python that extracts data from websites automatically. Uses requests and BeautifulSoup to make HTTP requests, parse HTML, and save data to files or JSON. Very useful for research and data collection.


Installation

uv add requests beautifulsoup4
# or
pip install requests beautifulsoup4

Step 1: Ask an AI for the scraper

I need a web scraper in Python that:
- Uses requests and BeautifulSoup
- Extracts titles and links from a page
- Handles connection errors
- Saves results to JSON
- Has delay between requests (be respectful)
- Includes appropriate user-agent

Give me the complete code.

Typical code

import requests
from bs4 import BeautifulSoup
import json
import time

def scrape_page(url: str) -> list[dict]:
    headers = {
        'User-Agent': 'Mozilla/5.0 (educational scraper)'
    }

    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
    except requests.RequestException as e:
        print(f"Error: {e}")
        return []

    soup = BeautifulSoup(response.text, 'html.parser')

    results = []
    for link in soup.find_all('a', href=True):
        results.append({
            'text': link.get_text(strip=True),
            'href': link['href']
        })

    return results

# Usage
data = scrape_page('https://example.com')
with open('results.json', 'w') as f:
    json.dump(data, f, indent=2)

Best practices

PracticeWhy
Delay between requestsDon't overload server
User-AgentIdentify yourself
robots.txtRespect site rules
Error handlingHandle timeouts and errors

Scraping ethics

โš ๏ธ Always check the site's terms of service. Some prohibit scraping.


Next level

โ†’ Public AI Chat with Auth โ€” Chef Level