Scrape any website to a formatted Excel workbook.
Ships with a working demo against books.toscrape.com — swap two parser functions to point it at your target site.
| Feature | How |
|---|---|
| Async, concurrent | httpx.AsyncClient + asyncio.Semaphore |
| Polite | Configurable concurrency, User-Agent header, tenacity retries with exponential backoff |
| Two-pass scrape | List pages → detail pages, enriched in parallel |
| Styled output | Header row bold + colored, frozen panes, auto-sized columns |
| CLI | scrape-to-excel --output books.xlsx --max-pages 5 |
| Containerized | docker run scrape-to-excel --max-pages 2 |
$ scrape-to-excel --output books.xlsx --max-pages 2
2026-04-25 16:03:12 INFO src.scraper: page 1: 20 books
2026-04-25 16:03:13 INFO src.scraper: page 2: 20 books
2026-04-25 16:03:21 INFO src.exporter: wrote 40 rows to books.xlsx
Wrote 40 rows to books.xlsx
Output columns: Title · Price (£) · Rating (1-5) · Availability · Category · UPC · Description · URL.
| Layer | Library |
|---|---|
| HTTP | httpx (async) |
| HTML parsing | beautifulsoup4 + lxml |
| Retries | tenacity |
| Tabular | pandas |
| Excel | openpyxl |
| CLI | click |
git clone https://github.com/DzenInCode/scrape-to-excel
cd scrape-to-excel
pip install -e .
scrape-to-excelOr with Docker:
docker build -t scrape-to-excel .
docker run -v $(pwd):/out scrape-to-excel --output /out/books.xlsx --max-pages 2Open src/scraper.py and replace two functions:
_parse_list_page(html, page_url) -> list[Book]— return the list of items found on a listing page (with title / price / url)_parse_detail_page(html) -> dict— return per-item enrichment fields
The Book dataclass holds the schema; rename it (Product, Listing, Job) and add fields as needed. The async fetcher, retry logic, semaphore concurrency, and Excel exporter stay unchanged.
scrape-to-excel [OPTIONS]
-o, --output PATH Output xlsx path [default: books.xlsx]
--base-url URL Site root URL [default: https://books.toscrape.com/]
--max-pages INT Max list pages to scrape [default: 5]
--concurrency INT Parallel detail-page requests [default: 5]
--log-level [DEBUG|INFO|WARNING|ERROR]
[default: INFO]
-h, --help Show this message and exit.
pip install -e ".[dev]"
pytest -v- Single-site assumption —
_parse_*functions are coupled to one site's HTML. The architecture (async fetch, retries, exporter) is reusable; the selectors are not. - No JavaScript rendering — for SPA sites, swap
httpxforplaywrightand callpage.content()instead ofclient.get(...).text. - Pagination is
?page=Nstyle — for cursor-based pagination, extendscrape()to follownextlinks. - Be polite — respect
robots.txtand the site's terms of service. The demo uses a site explicitly designed for scraping practice.
MIT — see LICENSE.
