Skip to content

DzenInCode/scrape-to-excel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scrape-to-excel

scrape-to-excel

Scrape any website to a formatted Excel workbook.

Ships with a working demo against books.toscrape.com — swap two parser functions to point it at your target site.

What you get

Feature How
Async, concurrent httpx.AsyncClient + asyncio.Semaphore
Polite Configurable concurrency, User-Agent header, tenacity retries with exponential backoff
Two-pass scrape List pages → detail pages, enriched in parallel
Styled output Header row bold + colored, frozen panes, auto-sized columns
CLI scrape-to-excel --output books.xlsx --max-pages 5
Containerized docker run scrape-to-excel --max-pages 2

Demo

$ scrape-to-excel --output books.xlsx --max-pages 2
2026-04-25 16:03:12 INFO src.scraper: page 1: 20 books
2026-04-25 16:03:13 INFO src.scraper: page 2: 20 books
2026-04-25 16:03:21 INFO src.exporter: wrote 40 rows to books.xlsx
Wrote 40 rows to books.xlsx

Output columns: Title · Price (£) · Rating (1-5) · Availability · Category · UPC · Description · URL.

Stack

Layer Library
HTTP httpx (async)
HTML parsing beautifulsoup4 + lxml
Retries tenacity
Tabular pandas
Excel openpyxl
CLI click

Quickstart

git clone https://github.com/DzenInCode/scrape-to-excel
cd scrape-to-excel
pip install -e .
scrape-to-excel

Or with Docker:

docker build -t scrape-to-excel .
docker run -v $(pwd):/out scrape-to-excel --output /out/books.xlsx --max-pages 2

Adapting for your site

Open src/scraper.py and replace two functions:

  1. _parse_list_page(html, page_url) -> list[Book] — return the list of items found on a listing page (with title / price / url)
  2. _parse_detail_page(html) -> dict — return per-item enrichment fields

The Book dataclass holds the schema; rename it (Product, Listing, Job) and add fields as needed. The async fetcher, retry logic, semaphore concurrency, and Excel exporter stay unchanged.

CLI options

scrape-to-excel [OPTIONS]

  -o, --output PATH       Output xlsx path  [default: books.xlsx]
  --base-url URL          Site root URL  [default: https://books.toscrape.com/]
  --max-pages INT         Max list pages to scrape  [default: 5]
  --concurrency INT       Parallel detail-page requests  [default: 5]
  --log-level [DEBUG|INFO|WARNING|ERROR]
                          [default: INFO]
  -h, --help              Show this message and exit.

Tests

pip install -e ".[dev]"
pytest -v

Limitations

  • Single-site assumption_parse_* functions are coupled to one site's HTML. The architecture (async fetch, retries, exporter) is reusable; the selectors are not.
  • No JavaScript rendering — for SPA sites, swap httpx for playwright and call page.content() instead of client.get(...).text.
  • Pagination is ?page=N style — for cursor-based pagination, extend scrape() to follow next links.
  • Be polite — respect robots.txt and the site's terms of service. The demo uses a site explicitly designed for scraping practice.

License

MIT — see LICENSE.

About

Scrape any website to a formatted Excel workbook. Working demo against books.toscrape.com — adapt the parser for your site.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors