Skip to content
OpenCatalogcurated by FLOSSK
AI & Machine Learning

Datasets

Hugging Face library for large shared datasets: memory mapping, streaming, Arrow-backed columns, and Hub integration.

Why it is included

Foundational OSS for reproducible NLP/LLM training data loading at scale.

Best for

Anyone fine-tuning or evaluating on multi-terabyte corpora without custom loaders.

Strengths

  • Streaming
  • Cache
  • Hub interoperability

Limitations

  • Very custom data may still need bespoke preprocessing

Good alternatives

WebDataset · Petastorm · tf.data

Related tools