🌐 Python · Web Crawl · Documentation

url_crawl.py

URL Crawl to DOCX

url_crawl.py crawls a website following same-domain links and exports the extracted text from each page into a Word document with one section per URL, with guardrails to prevent runaway crawls.

Auditors and compliance teams regularly need to capture the published content of a website: a vendor's privacy policy, an organisation's published security standards, a regulator's guidance page, as a point-in-time record. Manual copy/paste across dozens of pages is slow and the output is unstructured. Screenshots are not searchable. A dated Word document with one section per page is a usable, storable, citable evidence artifact.

You are scoping a third-party vendor risk assessment. The vendor's security and compliance documentation lives across fourteen pages of their public website. You need a snapshot dated to the start of the assessment as the baseline. You run this script against the vendor's documentation section, set max-pages to 20, and get a Word document with each page as a numbered section, all text extracted, and a table of contents. That document is date-stamped and attached to the assessment workpaper as the point-in-time baseline.

url_crawl.py starts from a given URL, follows same-domain links up to a configurable depth and page limit, extracts visible text from each page using BeautifulSoup, and writes a structured Word document using python-docx. Each page becomes a numbered section with the URL as the heading. Use --max-pages, --max-depth, and --delay to control crawl scope and be considerate of server load. Respect the target site's terms of service.

Script

Terminal
# Install dependencies
$ pip install requests beautifulsoup4 python-docx

# Basic crawl
$ python3 url_crawl.py --start-url https://example.com --output snapshot.docx

# Limit scope
$ python3 url_crawl.py --start-url https://example.com/docs --max-pages 20 --max-depth 3 --output vendor_docs.docx
Dependency / Permission Why it is needed
requestsHTTP GET for each page in the crawl
beautifulsoup4HTML parsing and visible text extraction
python-docxWord document assembly — one section per crawled URL
Network accessMust be able to reach the target domain
Sample output — success
  Start URL  : https://example.com/security/
  Max pages  : 20  |  Max depth : 3  |  Delay : 1s

  Crawling...
    [1/20]  /security/             → 1,204 words
    [2/20]  /security/policy/      → 3,812 words
    [3/20]  /security/certs/       → 847 words
    ...
    [14/20] /security/contact/     → 312 words

  Writing document → vendor_docs_2026-03-02.docx
  Done — 14 pages, 24,601 words, 1 document.
Sample output — error / warning
  WARNING: Page /security/legal/ returned 403 Forbidden — skipped.
  WARNING: Max pages (20) reached — crawl stopped at depth 2.
  Pages captured: 20 of estimated 35+ available.

  To capture more: increase --max-pages or run against specific sub-paths.
  Output still written → vendor_docs_partial_2026-03-02.docx

Regulation map

Framework Control / Clause Obligation
GDPR / UK GDPRArticle 13/14 — Privacy NoticesControllers must maintain accessible privacy notices. Crawl snapshots provide dated evidence of published notice content.
ISO 27001:2022A.18.1.1Legal, statutory, and contractual requirements must be documented. Website snapshots capture externally published security commitments.
PCI DSS v4.012.9Service provider compliance obligations must be documented. Crawling a vendor's compliance page captures their published assertions.
NIST CSF 2.0GV.SC-04Suppliers and third parties must be assessed. Published security documentation is a key input to third-party risk assessments.
DORA (EU)Article 28 — Third-party ICT riskDue diligence on ICT service providers requires review of their published security and resilience commitments.
SEBI CSCRFThird-party RiskSEBI-regulated entities must assess vendors' security posture. Published documentation is a baseline evidence source.

Feedback welcome: Corrections, ideas, and requests — grcguy@rtapulse.com.

Request an addition

What ऋतPulse means

rtapulse.com (ऋतPulse) combines ऋत (ṛta / ṛtá)—order, rule, truth, rightness—with Pulse (a living signal of health). It reflects how I think GRC should work: not a quarterly scramble, but a steady rhythm—detect drift early, keep evidence ready, and translate risk into decisions leaders can act on.