A profile of nonprofit Common Crawl, which has scraped billions of webpages since 2013, including paywalled ones, to build an archive used by OpenAI and others
Editor's note: This work is part of AI Watchdog, The Atlantic's ongoing investigation into the generative-AI industry. X: @kait_tiffany . Bluesky: @katienotopoulos , @damonberes.com , @justinhendrix , @damonberes.com , and @damonberes.com X: Kaitlyn Tiffany / @kait_tiffany : incredible story by @_alexreisner! i'm harrowed https://www.theatlantic.com/ ... Bluesky: Katie Notopoulos / @katienotopoulos : This article is really great bc it's about an org everyone knows but seems so boring we forget about it (Common Crawl) and turns out they're doing something extremely shady, AND the guy in charge keeps saying the most damning things to a reporter www.theatlantic.com/technology/ 2... Damon Beres / @damonberes.com : NEW: Common Crawl, the massive archiver of the web, has gotten cozy with AI companies and is providing paywalled articles for training data. They're also lying to publishers who have asked for material to be removed. “The robots are people too,” CC's exec director told us when we asked about this. Justin Hendrix / @justinhendrix : “In the process ... Common Crawl has opened a back door for AI companies to train their models with paywalled articles from major news websites. And the foundation appears to be lying to publishers about this—as well as masking the actual contents of its archives.” [embedded post] Damon Beres / @damonberes.com : Common Crawl says it complies with removal requests—while telling us they are “a pain in the ass”—but also is not actually removing the data in question. [images] Damon Beres / @damonberes.com : “We can't police that whole thing,” Common Crawl said. “It's not our job. We're just a bunch of dusty bookshelves.” — Meanwhile, CC has accepted hundreds of thousands in donations from AI companies such as OpenAI and Anthropic. And it expressed open antagonism toward the media: [image]