/
Navigation
C
Chronicles
Browse all articles
C
E
Explore
Semantic exploration
E
R
Research
Entity momentum
R
N
Nexus
Correlations & relationships
N
~
Story Arc
Topic evolution
S
Drift Map
Semantic trajectory animation
D
P
Posts
Analysis & commentary
P
Browse
@
Entities
Companies, people, products, technologies
Domains
Browse by publication source
Handles
Browse by social media handle
Detection
?
Concept Search
Semantic similarity search
!
High Impact Stories
Top coverage by position
+
Sentiment Analysis
Positive/negative coverage
*
Anomaly Detection
Unusual coverage patterns
Analysis
vs
Rivalry Report
Compare two entities head-to-head
/\
Semantic Pivots
Narrative discontinuities
!!
Crisis Response
Event recovery patterns
Connected
Nav: C E R N
Search: /
Command: ⌘K
Embeddings: large
TEXXR

Chronicles

The story behind the story

days · browse · Enter similar · o open

A profile of nonprofit Common Crawl, which has scraped billions of webpages since 2013, including paywalled ones, to build an archive used by OpenAI and others

Editor's note: This work is part of AI Watchdog, The Atlantic's ongoing investigation into the generative-AI industry. X: @kait_tiffany . Bluesky: @katienotopoulos , @damonberes.com , @justinhendrix , @damonberes.com , and @damonberes.com X: Kaitlyn Tiffany / @kait_tiffany : incredible story by @_alexreisner! i'm harrowed https://www.theatlantic.com/ ... Bluesky: Katie Notopoulos / @katienotopoulos : This article is really great bc it's about an org everyone knows but seems so boring we forget about it (Common Crawl) and turns out they're doing something extremely shady, AND the guy in charge keeps saying the most damning things to a reporter www.theatlantic.com/technology/ 2... Damon Beres / @damonberes.com : NEW: Common Crawl, the massive archiver of the web, has gotten cozy with AI companies and is providing paywalled articles for training data.  They're also lying to publishers who have asked for material to be removed.  “The robots are people too,” CC's exec director told us when we asked about this. Justin Hendrix / @justinhendrix : “In the process ... Common Crawl has opened a back door for AI companies to train their models with paywalled articles from major news websites.  And the foundation appears to be lying to publishers about this—as well as masking the actual contents of its archives.”  [embedded post] Damon Beres / @damonberes.com : Common Crawl says it complies with removal requests—while telling us they are “a pain in the ass”—but also is not actually removing the data in question.  [images] Damon Beres / @damonberes.com : “We can't police that whole thing,” Common Crawl said.  “It's not our job.  We're just a bunch of dusty bookshelves.”  —  Meanwhile, CC has accepted hundreds of thousands in donations from AI companies such as OpenAI and Anthropic.  And it expressed open antagonism toward the media: [image]

The Atlantic