Check out the ArchiveTeam's work on this: You can help by running their "appliance" in a virtual machine and the data is given to the Internet Archive.
"This collection includes web crawls of the Federal Executive, Legislative, and Judicial branches of government performed at the end of US presidential terms of office."
seeding is hosting a file so that it can be downloaded via bittorrent. typically how you download a file over torrent is by getting individual pieces of it from machines all over the internet, the machines supplying those pieces are "seeding" the file.
torrent is a file transfer protocol, it's done through torrent clients (programs you download and run) and it grabs the file from multiple other people that are sharing it (or seeding) instead of just from one central server.
The advantage is it can't as easily be taken down since it's distributed.
it sounds more complicated than it is. it’s just a program. it lets you download stuff from a bunch of other users (like cdc files). then you can join in and “seed” the files too. more people = more copies & faster downloads.
I saw you were downloading from the file on that page, like I was in my screenshot. There is a "review" on that page that points to a much better file. To get there from that page, click "Show All" on right, then search that page for .torrent and the second one will be the one in that "review".👍
I guess I'll add, I clicked "The dataset collection" on right, sorted the list by "Date Added" and found this one also which says it excludes datasets. Datasets are in that first one? I'm using the torrent file in the Show All. /shrug
It seems that my dl of the autogen torrent is updated on my end? File size went from 98GB to 104GB from when I looked last, and when I tried to add the USETHIS version it just added a new tracker and wouldn't dl the whole "new" dataset. Does that seem right? Or should I delete my local copy + re-DL?
Leave sufficient storage on your device to download. Download a torrent client yo set up storage locations
Add the torrent available on the web page.
Files will download, and once it's done, leave the program running in the background to upload/seed for other people
Wait for the "this item is currently being modified/updated by the task: derive" notice to pass, then hit the .torrent link in the Download Options field on the right.
They’ve always made it easy to get mirrors (current and full history) so no worries there. It’s not too difficult to spin up a local version of Wikipedia on your computer
I started doing so by his 4th day in office. I then went at a more frenetic pace to archive more on Jan 31. Fortunately I captured an HHS webpage with HIPAA guidance on complying with the 2024 Final Rule protecting reproductive healthcare in medical records. The page was gone after 5 pm on Jan 31.
Is that on I saw a lot there. I don’t know what’s behind professional paywall or what that costs. Looks like a fair amount is public.
As a healthcare & data privacy lawyer, I'm archiving *.gov webpages related to HHS laws and regulations. Thanks for sharing, which requires a subscription to access. From the description, it appears focused on clinical information as opposed to the legal information that I need.
"This collection includes web crawls of the Federal Executive, Legislative, and Judicial branches of government performed at the end of US presidential terms of office."
The advantage is it can't as easily be taken down since it's distributed.
Add the torrent available on the web page.
Files will download, and once it's done, leave the program running in the background to upload/seed for other people