The Wayback Machine won't be that helpful for this... it's a essentially a webcrawler and doesn't collect datasets such as are being talked about here.
Take a look at this system... you DON'T want a single point of failure such as the WM, but rather a networked resource like this and the LOCKSS networks:
Interesting but Lockss reads pretty centralized though? Even if the data is decentralized but the register seems to be controlled centrally? And isn't https://Archive.org kinda decentralized with their backuos in CA + EU + the magnet links?
Let’s put it this way: LOCKSS isn’t that centralized and secondly, in 30 years, I’ve yet to hear of any of their collections being hacked… unlike the WM and https://Archives.org. Also, backups are NOT the same thing as an archive.
Two issues. (1) They are pulling down web pages, which Wayback can help with. (2) They are pulling off data sets, which you are correct Wayback can't help with.
Also: Internet Archive created an almost exact copy of all federal websites for the change in administrations. Materials taken down will likely be archived there.
Some of the datasets are going to show up there it sounds like but the main archive there is of websites and databases they link to don’t always get well preserved
Are you trying to get people jailed or worse? Why on earth would anyone commit a name or ISP address to any list that the predatory coup leaders could get hold of?
I would email them and the Wayback Machine people. I think they are saving datasets as well. I mean, redundancy is good, but also I don't want people to get confused about where to go.
A hash is small string of characters, can be like 32 chars that is calculated from much larger data sets or documents,& so we can have hundreds or thousands of people save the hashes independently, or even trusted journalists have hashes saved to cross verify that data saved by others isn't tampered
We can even potentially create a register or IPFS or torrent file in which journalists upload all their hashes, Journalists and people participating all need to have a public Key though, and they can cryptographically sign all the hashes that they independently verified, and this we can build a very
robust dataset that is much smaller than the actual data that needs to be preserved which can help us verify the credibility of the data and ensure it hasn't been tampered with. It wouldn't be based off raw number of signatures but you'd be able to tell who to actually trust.
And feel free to mention my experience with my doctor yesterday, who discussed doing a TB test to rule a thing out...to which I responded by asking her about the TB outbreak in Kansas.
She'd heard NOTHING.
This is a good doc who pays attention. News is getting suppressed that doctors NEED.
Doc said it was just a blood test, but on the website, it says I have to go to one of the hospitals for it, can't just go to one of the many clinics for this health care "chain" that does blood work. Maybe I'm getting a skin test?
Thanks! My neighborhood definitely used govt data to fight synthetic turf being brought into local parks, so I let people know. Also lots of parents who might want to check the vaccine schedules, etc. ugh.
There are lots. And big organized groups at that. But a lot of journalists (esp freelancers) haven’t and might need specific stuff that the big groups don’t have. So that’s where today comes in
I have a colleague who runs a data storage + analysis platform, and he's interested in making some of these disappeared datasets publicly available on his platform. Who should he be connecting with?
And should be available on the web archive, r/datahoarder people are saving dats and thid project exists too since a few months: https://bsky.app/profile/eotarchive.org
I guess most of the public data should not get lost.
Comments
https://lil.law.harvard.edu/blog/2025/01/30/preserving-public-u-s-federal-data/
It is a single point of failure… which is why I keep pushing the #LOCKSS approach…
Thanks to the heroes documenting gov websites before they are taken down!
http://web.archive.org/web/20250000000000*/https://www.cdc.gov/yrbs/index.html
And I suspect their lawyers would not consider it a good policy.
Can’t guarantee everything is there but if it was downloadable on Jan 19, it should be archived.
https://old.reddit.com/r/DataHoarder/comments/1idj6dm/all_us_federal_government_websites_are_already/
We can also have people save *hashes* of the data set, so even of someone doesn't have the space, they can help maintain the integrity of the data
#ArchCDC
#ArchGOV
https://bsky.app/profile/rebound.liberalrepublic.org/post/3lgerjptddc2q
Do not add your name to a list, do not centralize the information. When the coast is clear people will volunteer it up.
She'd heard NOTHING.
This is a good doc who pays attention. News is getting suppressed that doctors NEED.
https://www.lockss.org/
Also it should be noted that last year, the CDC did a major site redesign and irresponsibly erased a lot of content and broke many links.
There's also a contact email here to suggest additional datasets that didn't get captured and could be part of their collection https://lil.law.harvard.edu/blog/2025/01/30/preserving-public-u-s-federal-data/?ref=404media.co
Might be useful to combine efforts!
Would be much better for people to be able to contribute through more secure channels if possible
I guess most of the public data should not get lost.
Godspeed, and thank you.
#DATALOVE
#Telcomix
#MeMBu
https://www.tiktok.com/t/ZTYMxB22Y/
https://blog.archive.org/2024/05/08/end-of-term-web-archive/