Just to make it formal: I have not consented to any use of my site data for any machine learning project. Nor would I have ever, under any circumstances.
Hugging face is involved in some super fucking horrible stuff. They try to keep it on the dl but it's there if you look. If your brand works with goddamn Boston Dynamics you are, in fact, the bad guy.
one hell of a "mistake" dude. most people don't accidentally steal data from hundreds of thousands of people like this. I want to believe that you learned your lesson, but the only way that'll be true is if I never hear about you again
Getting back to this for a moment, Daniel uses the phrase "deleted for now" twice more in the repo. This is for anyone reading this later. Why he uses that specofic phrase not just here, but also more than once in the repo comment is beyond me if its as simple as "we want to keep the dialogue open"
I feel like you are attributing malice where there is none. It seems pretty clear to me that the implication is that when we have figured out how to let users consent, a new dataset can be created. The bluesky team are also interested in figuring this out.
Im not attributing malice. Im incredulous. The language was there an hour ago. As we spoke in another thread, Daniel respinded to multiple people while the set was up, but only people who spoke positively of the scraped data. Around the time the set got pulled bsky issued a statement about scraping
Are you aware of Bluesky datasets posted on Hugging Face after Daniel's dataset was deleted? These dataset were created without respect for user consent. Bluesky users who ask for their data to be removed are being trolled. Torrents are being made and posted on HF. https://huggingface.co/datasets/alpindale/two-million-bluesky-posts/discussions
My wishes, many people's wishes, arguably most people's wishes, are that the LLMs being fed this data would simultaneously implode, permanently and irrevocably.
I love your implication that progress is simply inevitable, as if you yourself weren't one of the hands turning the crank that you could simply stop doing at literally any time.
The default standard, when no standard is defined, is don't touch other people's data at all. If you don't ask first you are removing consent no matter how you spin it.
You don't actually have to build the Torment Nexus, and you certainly don't need to try and justify building the Torment Nexus just because someone else might do it later.
Let me see if I understand. Your excuse is that you assumed someone was going to consentlessly scrape everyones shit for personal gain and the detriment of users and the planet, and then thought "well golly, we better get in there first!"
Thats my concern. And props to pngwn here for trying to be in front of it and being a much better communicator on this than Daniel has. But im not convinced that this wont happen again, and i wish the entire AI bubble would hurry up and burst.
I want to be able to consent to having my posting and commenting available for anyone to use. Please give me a way to do that -- or if there is a way, please publicize it better!
I respect people who don't want their writing and thoughts used for training. On the other hand this thread a good source of people who rudely express strong negative judgments with no explanation. I block those people. Keep BlueSky friendly (or at least civil)!
I honestly don't know if it is because that's part of the protocol, to have everything ever posted be public and publicly available. Once there is public data, I don't know how you obtain consent so I don't think it's been clearly challenged in EU courts. The act of posting is givng consent in a way
That's not exactly the same. There's extremely clear rulings around that (though is legal precedent really something we can rely on anymore...😓), there _aren't_ around this. I honestly think it was probably illegal but it's gray enough that it would probably require a ruling.
It is deleteable in the app BTW so anyone using the app - the thing actually provided by the company - so anyone using that cannot access it through said app (or website)
As in, if the fundamental architecture means deletability is impossible, they have a much much much bigger problem, because GDPR says “no, you move. Or be fined into the Sun.”
Thank you I for actually listening to me and removing the dataset. Although this is just a small step, 1 million posts is a huge number. Again, thanks for taking it offline.
If you decide to reverse that "for now" this is me formerly notifying you that I do not consent to you storing my content or using it for ML training purposes.
Why make it accessible to everyone when just Mark and Elon should have it? Sorry you got all this backlash for doing a good thing. It’s not your fault people don’t realize what year it is.
That you have done the bare minimum in terms of honouring my intellectual property is fine. However, from a legal standpoint, I would like clarification as to what you mean by "for now". My IP will remain mine until many years after my death. This is not likely to change.
you've started a social trend of bad actors using the api to deliberately create antagonistic bsky datasets on huggingface (ie: "two-million-bluesky-posts" repo)
completely inadequate response to the fire you started. as usual, you do massive damage and then retreat bashfully and innocently
Great first move. Consider when you do things like this that this is *exactly* why folks get angry about how machine learning folks are approaching our data. Are our conversations in public? Sure, but recording them and then using them for your own purposes is still incredibly tacky w/o permission.
Transparency be damned, this is a massive breach of rules of consent. Reading through the Hugging Face feed left me with serious feelings of grave concern.
This could prompt me to exit Bluesky. Thinking about it.
In order to harvest data from Bluesky, you would need to explicit consent of every user whose data you collected. Not doing so is a violation of copyright laws in most countries.
Comments
Because...that would be apt.
The word you might be thinking of is "facehuggers"
Reported
https://bsky.app/profile/chickenpuppet.bsky.social/post/3lbvbzl4abc25
https://bsky.app/profile/did:plc:7e5mpxuweopubhexwqg5l3ba/post/3lbvih4luvk23
Someone will create this dataset, we’d rather it was done in a way that is respectful of users wishes. That seems to be what everyone wants too.
What we can do though is work towards a solution that ensures your data, and the data of anyone who agrees with you, does not get used to train AI.
Closed companies will do what they do, we can’t affect that, but we can try to set some standards.
https://docs.bsky.app/docs/api/com-atproto-repo-list-records
Here's where you can find your posts:
https://bsky.social/xrpc/com.atproto.repo.listRecords?repo=jedharris.bsky.social&collection=app.bsky.feed.post
There are *pointers* to things you've replied to, but only your own posts are actually returned. (I really like the HTTP API.)
Please ensure it is not in any future data sets.
That you have done the bare minimum in terms of honouring my intellectual property is fine. However, from a legal standpoint, I would like clarification as to what you mean by "for now". My IP will remain mine until many years after my death. This is not likely to change.
completely inadequate response to the fire you started. as usual, you do massive damage and then retreat bashfully and innocently
Transparency be damned, this is a massive breach of rules of consent. Reading through the Hugging Face feed left me with serious feelings of grave concern.
This could prompt me to exit Bluesky. Thinking about it.
And that is not a bad thing in my book.
(I do get to whinge about bad ideas, though.)