I’ve been doing public and private work on Common Crawl — the open repository of web crawl data that underpins a huge amount of research and AI training.
Two specific contributions:
- cc-pyspark — Added support for file-wise processing, enabling more efficient batch operations on the crawl corpus.
- webarchive-indexing — Migrated legacy mrjob tasks to modern Spark jobs to process 9PB+ of crawl data.
Jason Grey