In the HuggingFace/Bluesky incident, the problem goes deeper than whether the data is "public" or "private" What matters to people is whether their data was collected, which data was collected, how it may be used, and who it may be used by - ThreadSky

ryanjgallag.com • 91 days ago

In the HuggingFace/Bluesky incident, the problem goes deeper than whether the data is "public" or "private"

What matters to people is whether their data was collected, which data was collected, how it may be used, and who it may be used by

Comments

cv-danes.com•91 days ago

Unless specific guardrails are in place on the site, the current [purposely] lax US stance on data privacy means that you should assume anything you post will be harvested and misused. So post accordingly.

ryanjgallag.com•91 days ago

1) Whose data was collected?

In theory, anyone's. Bluesky is designed to make everyone's data accessible. You don't even need to authenticate to grab the data from an API!

ryanjgallag.com•91 days ago

While Bluesky is very transparent about how it's built this way intentionally, not everyone realizes this means nearly all user data is easily accessible by pretty much everyone

It's a major shortcoming that the AT protocol does not currently accommodate any sense of "private" data

ryanjgallag.com•91 days ago

2) What data was collected?

Post and account data was hosted *directly* on HuggingFace

Best practice for sharing social media publicly is to post the IDs of posts and accounts, rather than the posts and accounts themselves

Then, researchers can use these IDs to "rehydrate" the data themselves

ryanjgallag.com•91 days ago

This has several benefits

If a user deletes their post or account, then their data is not persisted outside of the system

It also adds friction. Using the data requires more time and effort (spent rehydrating it) than simply downloading a large file from HuggingFace

ryanjgallag.com•91 days ago

3) How was the data going to be used?

A lot of the blowback is about the data being used to train generative AI / LLMs (especially proprietary ones)

ryanjgallag.com•91 days ago

People's consent changes based on how their data may be used

Is it for gen AI? Is it for spam filtering? Is it for public health monitoring? Is it for election forecasting? Is it for abuse detection?

adityaponnada.bsky.social•91 days ago

Also given that data was collected in early adoption of the platform, will it change the way people use it?

Comments

Posting Rules

Reply