I’ve been doing public and private work on Common Crawl — the open repository of web crawl data that underpins a huge amount of research and AI training.

Two specific contributions:

  • cc-pyspark — Added support for file-wise processing, enabling more efficient batch operations on the crawl corpus.
  • webarchive-indexing — Migrated legacy mrjob tasks to modern Spark jobs to process 9PB+ of crawl data.