elegant_wallaby

They removed content using the instances we found as well as an apparent expanded MD5 hash set, chaffing the removals by removing a large amount of other content, including using keywords that could relate to CSAM and non-CSAM that could contain sensitive info about kids. 2/15

2024-08-31 View on X

TechCrunch

LAION, a research org whose dataset was used to train Stable Diffusion and other models, releases a new dataset it claims has been “thoroughly cleaned” of CSAM

LAION, the German research org that created the data used to train Stable Diffusion, among other generative AI models …

View original

LAION has released a revised version of the LAION-5B dataset to address CSAM concerns we previously highlighted. Here are my impressions. 1/15 https://laion.ai/...

2024-08-31 View on X

TechCrunch

LAION, a research org whose dataset was used to train Stable Diffusion and other models, releases a new dataset it claims has been “thoroughly cleaned” of CSAM

LAION, the German research org that created the data used to train Stable Diffusion, among other generative AI models …

View original

These are obviously all significant improvements; credit to LAION and the other involved child safety orgs for all their work. It's not quite what I would call a gold standard, but it definitely sets a much better example. 4/15

2024-08-31 View on X

TechCrunch

LAION, a research org whose dataset was used to train Stable Diffusion and other models, releases a new dataset it claims has been “thoroughly cleaned” of CSAM

LAION, the German research org that created the data used to train Stable Diffusion, among other generative AI models …

View original

They also removed all content above a conservative “unsafe” probability; one dataset removing >0.95, which we found covered almost all our matches. The other dataset is far more conservative, removing the majority of NSFW samples. 3/15

2024-08-31 View on X

TechCrunch

LAION, a research org whose dataset was used to train Stable Diffusion and other models, releases a new dataset it claims has been “thoroughly cleaned” of CSAM

LAION, the German research org that created the data used to train Stable Diffusion, among other generative AI models …

View original

The NCII part is key IMO, because a *lot* of imagery that shows up in random crawls is non-consensual, of dubious provenance or at the very least copyrighted. And also just private imagery, and risky given identity and age ambiguity. 6/15

2024-08-31 View on X

TechCrunch

LAION, a research org whose dataset was used to train Stable Diffusion and other models, releases a new dataset it claims has been “thoroughly cleaned” of CSAM

LAION, the German research org that created the data used to train Stable Diffusion, among other generative AI models …

View original

We used a combination of methods to determine this: perceptual hashing, cryptographic hashing, and k-nearest neighbors analysis using the image embeddings. Seeded from a small subset of the dataset, PhotoDNA identified hundreds of instances, the URLs of which which were reported to NCMEC.

2023-12-21 View on X

Bloomberg

Stanford researchers: LAION-5B, a dataset of 5B+ images used by Stability AI and others, contains 1,008+ instances of CSAM, possibly helping AI to generate CSAM

most prominently, Stable Diffusion 1.5—to see to what degree CSAM itself might be present in the training data. https://purl.stanford.edu/... Alex Stamos / @alex.stamos : Lots of p...

View original

I'm not sure what the legal implications are for this; most CSAM possession laws were made with the assumption that only huge service providers would have this much storage of mixed data, and they generally have detection and reporting flows. But all LAION-5B images can fit in a backpack.

2023-12-21 View on X

Bloomberg

Stanford researchers: LAION-5B, a dataset of 5B+ images used by Stability AI and others, contains 1,008+ instances of CSAM, possibly helping AI to generate CSAM

most prominently, Stable Diffusion 1.5—to see to what degree CSAM itself might be present in the training data. https://purl.stanford.edu/... Alex Stamos / @alex.stamos : Lots of p...

View original

Fixing this problem is going to be difficult. The datasets are already out there, and the models are already trained. While we've made good progress in getting content removed from the source URLs, removing it from public datasets gives people a map to CSAM and its associated image embeddings.

2023-12-21 View on X

Bloomberg

Stanford researchers: LAION-5B, a dataset of 5B+ images used by Stability AI and others, contains 1,008+ instances of CSAM, possibly helping AI to generate CSAM

most prominently, Stable Diffusion 1.5—to see to what degree CSAM itself might be present in the training data. https://purl.stanford.edu/... Alex Stamos / @alex.stamos : Lots of p...

View original

As a follow-up to our work on computer-generated CSAM, we took a closer look at the training data used to train various generative models—most prominently, Stable Diffusion 1.5—to see to what degree CSAM itself might be present in the training data. https://purl.stanford.edu/...

2023-12-21 View on X

Bloomberg

Stanford researchers: LAION-5B, a dataset of 5B+ images used by Stability AI and others, contains 1,008+ instances of CSAM, possibly helping AI to generate CSAM

most prominently, Stable Diffusion 1.5—to see to what degree CSAM itself might be present in the training data. https://purl.stanford.edu/... Alex Stamos / @alex.stamos : Lots of p...

View original

“certain discrepancies have emerged in the material used” The passive voice doing some heavy lifting here https://twitter.com/...

2022-10-24 View on X

The Wire

The Wire retracts two recent stories about Meta's XCheck program and says it is using “independent external experts” to investigate its coverage

View original

Also note that every other email that people have presented — and every email I've received from fb\.com going back to 2016 — is formatted differently from what the video shows. The header list is invariably lowercase and with padding around the colons. 1/2 https://twitter.com/... https://twitter.com/...

2022-10-16 View on X

The Wire

In response to Meta's rebuttal of its XCheck report, The Wire shares a video of a source using a subdomain, DKIM signatures, and more, but experts are skeptical

& many mainstream foreign journalists also questioned The Wire's work. Now, @thewire_in says it's verified the email via- its DKIM signature. https://thewire.in/... Matthew Green /...

View original

You have no assurance that the Facebook you're getting is the same as other users—in fact you're guaranteed it *isn't*, given A/B experiments and regional issues. There's no meaningful way to audit it and ensure that it hasn't been altered to target you in some way. 6/

2021-12-04 View on X

Wired

Before implementing e2ee, Meta must improve its existing content-oblivious harm-reduction mechanisms, limit recommendation engines and discoverability, and more

in fact you're guaranteed it *isn't*, given A/B experiments and regional issues. There's no meaningful way to audit it and ensure that it hasn't been altered to target you in some ...

View original

ripping messaging out of websites entirely, and relying on purpose-built messaging apps the same way we do with phones and addresses. It's not entirely satisfying or entirely convenient, but IMO the reduced complexity and attack surface is worth it. 13/13

2021-12-04 View on X

Wired

Before implementing e2ee, Meta must improve its existing content-oblivious harm-reduction mechanisms, limit recommendation engines and discoverability, and more

in fact you're guaranteed it *isn't*, given A/B experiments and regional issues. There's no meaningful way to audit it and ensure that it hasn't been altered to target you in some ...

View original

Some thoughts on the complexities that bogged down @Meta's E2EE efforts, and hopefully some hints at a way forward: https://www.wired.com/...

2021-12-04 View on X

Wired

Before implementing e2ee, Meta must improve its existing content-oblivious harm-reduction mechanisms, limit recommendation engines and discoverability, and more

in fact you're guaranteed it *isn't*, given A/B experiments and regional issues. There's no meaningful way to audit it and ensure that it hasn't been altered to target you in some ...

View original

WhatsApp doesn't recommend people to befriend and interact with. It doesn't host secret groups of unlimited size. It doesn't provide global search of every user. It doesn't group people by location or institutions like high schools. 9/

2021-11-24 View on X

@elegant_wallaby

[Thread] A former Facebook employee says Meta announced an “absurdly accelerated timeline” for e2ee messaging to preempt antitrust action and generate good PR

David Thiel / @elegant_wallaby :

View original

Has “but the children” been an excuse for all kinds of terrible ideas and government overreach? Absolutely. And government will indeed use it to try to hamper E2EE. But that doesn't mean that real child safety concerns are imaginary or minimal. 14/

2021-11-23 View on X

@elegant_wallaby