1、Common Crawl FoundationChris Tolles chriscommoncrawl.orgPetros Zerfos Principal Research Scientist&Manager,IBM Research4565“Enriching the Common Crawl for LLM Training using the Data Prep Kit(LF AI&Data Project)”What is Common Crawl?Free archive of the public Internet since 2007The main training dat
2、aset behind every modern AI 10 Petabytes,250+billion pages web crawlCited in over 10,000 research papersCommon Crawl is Critical to AI.Why?82%of GPT-3 raw tokens came from Common CrawlCommon Crawl is the primary training data for every LLM in production(including OpenAI,Anthropic,Mistral,IBM Granite
3、 et al)Common Crawl is the only training data both large enough&freely available to everyone for a viable LLMNot Bad Since it Wasnt Built for AI!Citings on the Uptick A Library Without a CatalogCommon Crawl=a giant library of billions of web pagesProblem:No card catalog,(or Dewey System!)everything
4、is jumbled upSome pages are excellent(science,education,medical)Others are junk(spam,boilerplate,low-quality chatter)Hard to find the“good books”without helpNo way to pull out a subset by category or quality level out of our 11 PB FineWeb:A Huge,Open Web DatasetIntroduced by Hugging Face in 2024Base
5、d on Common Crawl raw web dataPreprocessed into a large-scale,deduplicated datasetCovers 15 trillion tokens(44 TB of text)Designed specifically for LLM pre-trainingThe FineWeb Recipe Deduplication:removed exact text repeatsLanguage detection:tagged documents by languageBasic filtering:low-quality an
6、d boilerplate content removedOpen release:hosted on Hugging Face for everyoneCreate a General-Purpose,Clean Web Dataset at ScaleEnter GneissWeb(IBM Research)Built on top of FineWeb(15T tokens of web text)Created a recipe to filter&categorize documentsUsed machine learning annotators to measure quali