In the HuggingFace/Bluesky incident, the problem goes deeper than whether the data is "public" or "private"
What matters to people is whether their data was collected, which data was collected, how it may be used, and who it may be used by
What matters to people is whether their data was collected, which data was collected, how it may be used, and who it may be used by
Comments
In theory, anyone's. Bluesky is designed to make everyone's data accessible. You don't even need to authenticate to grab the data from an API!
It's a major shortcoming that the AT protocol does not currently accommodate any sense of "private" data
Post and account data was hosted *directly* on HuggingFace
Best practice for sharing social media publicly is to post the IDs of posts and accounts, rather than the posts and accounts themselves
Then, researchers can use these IDs to "rehydrate" the data themselves
If a user deletes their post or account, then their data is not persisted outside of the system
It also adds friction. Using the data requires more time and effort (spent rehydrating it) than simply downloading a large file from HuggingFace
A lot of the blowback is about the data being used to train generative AI / LLMs (especially proprietary ones)
Is it for gen AI? Is it for spam filtering? Is it for public health monitoring? Is it for election forecasting? Is it for abuse detection?