Heritrix
Extensible, web-scale, archival-quality crawler produced by the Internet Archive for capturing sites into WARC files.
Why it is included
The classic open crawler behind broad web preservation programs and many pywb ingest pipelines.
Best for
Organizations running scoped crawls with politeness rules and WARC output requirements.
Strengths
- Mature crawl engine
- WARC focus
- Broad adoption
Limitations
- Java tuning; legal/robots ethics training mandatory
Good alternatives
Browsertrix · wget + warc-tools · commercial crawlers
Related tools
Archiving & digital preservation
pywb
High-performance Python web archive replay stack (WARC) used by Webrecorder and many institutions for Wayback-style access.
Archiving & digital preservation
Archivematica
End-to-end digital preservation workflow: ingest, virus scan, normalization, METS/PREMIS metadata, AIP storage, and DIP access packages.
Archiving & digital preservation
Omeka S
Multisite web platform for scholarly and cultural collections: linked open data, resource templates, modules, and IIIF-friendly patterns.
Archiving & digital preservation
Omeka Classic
PHP/MySQL platform for online collections and exhibits—simple item Dublin Core, themes, and plugin ecosystem.
Archiving & digital preservation
AtoM (Access to Memory)
Web-based archival description application aligned with ISAD(G), ISAAR, RAD, and DACS—multi-level finding aids and authority records.
Archiving & digital preservation
ArchivesSpace
Archival information management for accessioning, description, locations, agents, and public discovery interfaces.
