scrape-to-excel

Scrape any website to a formatted Excel workbook.

Ships with a working demo against books.toscrape.com — swap two parser functions to point it at your target site.

What you get

Feature	How
Async, concurrent	`httpx.AsyncClient` + `asyncio.Semaphore`
Polite	Configurable concurrency, `User-Agent` header, `tenacity` retries with exponential backoff
Two-pass scrape	List pages → detail pages, enriched in parallel
Styled output	Header row bold + colored, frozen panes, auto-sized columns
CLI	`scrape-to-excel --output books.xlsx --max-pages 5`
Containerized	`docker run scrape-to-excel --max-pages 2`

Demo

$ scrape-to-excel --output books.xlsx --max-pages 2
2026-04-25 16:03:12 INFO src.scraper: page 1: 20 books
2026-04-25 16:03:13 INFO src.scraper: page 2: 20 books
2026-04-25 16:03:21 INFO src.exporter: wrote 40 rows to books.xlsx
Wrote 40 rows to books.xlsx

Output columns: Title · Price (£) · Rating (1-5) · Availability · Category · UPC · Description · URL.

Stack

Layer	Library
HTTP	`httpx` (async)
HTML parsing	`beautifulsoup4` + `lxml`
Retries	`tenacity`
Tabular	`pandas`
Excel	`openpyxl`
CLI	`click`

Quickstart

git clone https://github.com/DzenInCode/scrape-to-excel
cd scrape-to-excel
pip install -e .
scrape-to-excel

Or with Docker:

docker build -t scrape-to-excel .
docker run -v $(pwd):/out scrape-to-excel --output /out/books.xlsx --max-pages 2

Adapting for your site

Open src/scraper.py and replace two functions:

_parse_list_page(html, page_url) -> list[Book] — return the list of items found on a listing page (with title / price / url)
_parse_detail_page(html) -> dict — return per-item enrichment fields

The Book dataclass holds the schema; rename it (Product, Listing, Job) and add fields as needed. The async fetcher, retry logic, semaphore concurrency, and Excel exporter stay unchanged.

CLI options

scrape-to-excel [OPTIONS]

  -o, --output PATH       Output xlsx path  [default: books.xlsx]
  --base-url URL          Site root URL  [default: https://books.toscrape.com/]
  --max-pages INT         Max list pages to scrape  [default: 5]
  --concurrency INT       Parallel detail-page requests  [default: 5]
  --log-level [DEBUG|INFO|WARNING|ERROR]
                          [default: INFO]
  -h, --help              Show this message and exit.

Tests

pip install -e ".[dev]"
pytest -v

Limitations

Single-site assumption — _parse_* functions are coupled to one site's HTML. The architecture (async fetch, retries, exporter) is reusable; the selectors are not.
No JavaScript rendering — for SPA sites, swap httpx for playwright and call page.content() instead of client.get(...).text.
Pagination is ?page=N style — for cursor-based pagination, extend scrape() to follow next links.
Be polite — respect robots.txt and the site's terms of service. The demo uses a site explicitly designed for scraping practice.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
assets		assets
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrape-to-excel

What you get

Demo

Stack

Quickstart

Adapting for your site

CLI options

Tests

Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scrape-to-excel

What you get

Demo

Stack

Quickstart

Adapting for your site

CLI options

Tests

Limitations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages