As many of you following main might know @hf.co released 1M posts scraped from the @bsky.app Firehose API. It was a mess. People got angry, and I thought -- #privacydisaster. So I decided to write about it. insights.priva.cat/p/privacy-di... Let's unpack this a bit... 👇 - ThreadSky

priva.cat • 87 days ago

As many of you following main might know @hf.co released 1M posts scraped from the @bsky.app Firehose API. It was a mess. People got angry, and I thought -- #privacydisaster.

So I decided to write about it. https://insights.priva.cat/p/privacy-disasters-facehuggers-are

Let's unpack this a bit... 👇

Comments

morselya.bsky.social•86 days ago

For when I need to review my own privacy risk exposure...

📌

daraghobrien.bsky.social•87 days ago

So, I sent in my MOADSAR (mother of all DSARs) last week. Unlike others I didn’t limit my scope just to the two Bluesky datasets I knew about, but any uploaded dataset. Including datasets scraped from other platforms.
I also included an Article 18 restriction of processing request.

daraghobrien.bsky.social•87 days ago

So, I am expecting to spend a few days in early January crafting my complaint to DPC, which will then have to go via EDPB to the CNIL.

daraghobrien.bsky.social•87 days ago

But before then, this whole scenario will be featuring in a webinar I’m doing (and in the university course I’m teaching in January)

handle.invalid•86 days ago

I envy your students!

priva.cat•87 days ago

Feel free to use any of my snark in that. I need people's affirmation at my humor.

daraghobrien.bsky.social•87 days ago

I will be, and citing my source. My session was already going to be snarky. This just makes it more ‘narrative’

taraba.bsky.social•86 days ago

How would you compare this case with LinkedIn’s data scraping ?

handle.invalid•86 days ago

I don't quite agree with what you said about Bluesky also fucking over it's users; I don't see what they can do about it, given the open nature of ATproto. HF, on the other hand, needs to take a good hard look at their own rules (because AlpinDale's dataset violates all of them) and action that.

handle.invalid•86 days ago

Also noticed I already replied but consider this a 2nd bit of opinion :D And a reminder I need to track my replies better... sorry! :D

priva.cat•86 days ago

Bluesky is not fucking over its users, so much as not enforcing its own policies.

Many people have reported @alpindale.bsky.social and others, and Bsky basically gave them a pass, despite this very clearly violating their policies. I'm mostly salty about that. And technical sloppiness.

handle.invalid•86 days ago

They did ban him - that's about all they can do. Technical sloppiness, you'll have to explain that one. ATproto is designed to be open, the firehose being one notable byproduct - and while yes, technically it would be possible to lock it down, it will never be impossible to abuse it regardless. 1/2

handle.invalid•86 days ago

It's much like hacking; nothing is unhackable, you just need to raise the bar high enough that threat actors look for easier targets. But often raising the bar high enough means extra hoop-jumping for developers.

priva.cat•86 days ago

Oh, I know nothing is unhackable. One thing they could do, for example to make it easier is to require account authentication / API keys that could be easily revoked. Rate limits also might work.

Or something closer to what Mastodon does, where users can toggle public sharing.

handle.invalid•86 days ago

Problem with that toggle is that if you don't share on the firehose you may as well close your Bluesky account. I guess you could do something in the PDS, a sort of "these NSID's can be fetched from the following sources" list, but that'd put self-hosting very much straight into the hands of the 1/3

priva.cat•86 days ago

They rescinded the ban. That's what he was bragging about.

handle.invalid•86 days ago

No way... okay so, that does indeed tick me off too. If they banned him he oughta stay banned. Otherwise it just looks like pacifying the angry masses into putting their torches and pitchforks away...

priva.cat•86 days ago

https://bsky.app/profile/alpindale.bsky.social/post/3lbzu6bfekc2i

priva.cat•86 days ago

Yes.

handle.invalid•87 days ago

Good 'un - the thing is, Bluesky can't technically do much at the moment due to how the firehose operates, and changing that isn't a one-day-done-and-dusted affair. It could be done, and probably should be done. HF on the other hand, regardless of their "rules" 1/2

handle.invalid•87 days ago

2/2 knows damn well consent was not obtained, a look at how their staff has responded thus far makes that painfully obvious.

kdw.bsky.social•86 days ago

the thing is that Bluesky advertised that their data can be used by third parties here https://arxiv.org/abs/2402.03239

priva.cat•86 days ago

Yes, the general use by third parties is also in their privacy statement, and I conceded that pretty early up front.

The issue isn't can, the issue is should, and does that extend to absolutely anything and everything?
Lots of "public" things still have use limitations!

kdw.bsky.social•86 days ago

yes, I know you know! my reply was less to the person who replied to you saying something like “they haven’t got around to implementing it yet…”. No! this is how it was intentionally built, it’s a feature not a bug and the HF use is totally in line with that - the BS response was disingenuous….

handle.invalid•86 days ago

From a legal standpoint, HF's use was not in line with that - not according to their own TOS, and definitely not according to GDPR (heck, HF itself isn't GDPR compliant even though they should be, being an EU company).

kdw.bsky.social•86 days ago

the only way anything will be changed or stopped is if the EU says this violates GDPR….(not exactly loving how replies work here, sorry if you thought it was addressed at you). good thread!

priva.cat•87 days ago

Agree on the technical aspects. I am salty about Bluesky mod team not kicking Alpindale & other accounts off for doing this though, despite his use being a very clear Bsky policy violation. https://bsky.app/profile/alpindale.bsky.social/post/3lbzu6bfekc2i

priva.cat•87 days ago

Bluesky uses the #ATProto protocol for a federated & decentralized network, promising user autonomy. They are also quite transparent that posts and blocks are public.

But when public posts are mined for AI datasets without consent, things get murky. Is "public" really fair game for AI training?

priva.cat•87 days ago

@hf.co's dataset included user DIDs, which are persistent IDs tied to accounts. Are they public? Sure. But they're also identifiable. And this creates some problems #legally speaking.

Public for communicating on a social network is one thing -- but should that translate to #AI training fodder?

priva.cat•87 days ago

In short, it's about #context.

Sharing skeets on Bluesky ≠ consenting to AI scraping. Imagine you're at a party and a game of Truth or Dare leads to you singing (badly) at a party. You might consent to amusing the guests, but what if your off-key performance of 'Ken Lee' ends up on YouTube?

teflondub.bsky.social•87 days ago

A timely reminder that consent should be enthusiastic, continuous and may be withdrawn at any time.

katekaye.bsky.social•87 days ago

enthusiastic could be a stretch...

teflondub.bsky.social•87 days ago

In this context, sure! Maybe intentional?

handle.invalid•86 days ago

Depends on what I'm consenting to really. if it involves happy fun time with my SO then hell yes I'm absolutely enthusiastically consenting. If it's consent to allow my dentist to work on my teeth it's... not all that enthusiastic. If it's training an LLM, then it's oh hell the fuck no :D

europaulb.seriousprivacy.eu•87 days ago

Truth or dare leading to karaoke? I guess something much more lewd is more likely… 🙃

priva.cat•87 days ago

I was trying to go for a not-entirely-scandalous example, and I thought of the hilarious Ken Lee video from years ago. My original example involved people with strange toenail fetishes and weird bumps on private bits though, but then I worried people would @ me in the comments :D

handle.invalid•86 days ago

So I have this weird bump...

priva.cat•87 days ago

Instead of being mocked by a few drunk party-goers, you're now the butt of jokes for millions. _That's_ what people are getting angry at when they protested the use of their skeets.

It's about context. Nobody expected that Bluesky's firehose data would be weaponized for AI datasets.

priva.cat•87 days ago

What adds more insult to injury was the response by @moderation.bsky.app & leadership, and @hf.co to the half a dozen other assholes posting even _larger_ datasets of tweets after the batch of 1M tweets was pulled.

In short, they did nothing.

https://huggingface.co/datasets?search=bluesky%20posts

priva.cat•87 days ago

The lack of #consent or #transparency in Bluesky/Hugging Face's data handling highlights a deeper problem: tech promises decentralization but stumbles into the same traps as big platforms. 🚩

We skeet for conversations, not AI training. Misusing this breaches privacy norms & legal principles.

blkgrlhistorian.bsky.social•87 days ago

📌

privacymatters.bsky.social•87 days ago

📌

Comments

Posting Rules

Reply