Heritrix

Extensible, web-scale, archival-quality crawler produced by the Internet Archive for capturing sites into WARC files.

Why it is included

The classic open crawler behind broad web preservation programs and many pywb ingest pipelines.

Organizations running scoped crawls with politeness rules and WARC output requirements.

Browsertrix · wget + warc-tools · commercial crawlers