Folks who use #rstats with github, how am I supposed to be managing the data for my project with 100mb file size limit? Am I going about this all wrong?
Comments
Log in with your Bluesky account to leave a comment
You can use git-lfs (large file storage). But yes the real question is why you want to put that data on GitHub; we care about tracking changes in code usually. Data is static. Maybe a cheap S3 bucket more suitable? Dropbox?
I wasn't meaning to say that putting it on github via git-lfs makes no sense; been thinking about it myself at times. FWIW sharing raw data on dropbox worked well for me so far.
I'm glad you asked 🤗. There are two components, i post both as pictures. First each collaborator is instructed to set an environment variable pointing to the local dropbox root (via an instructive error). Once you have that, the `paths` function will give you whatever you need inside the data.
There's a reason you don't want big files in Github and that's because it will record *every* change to every file. So if you change one data point in a big dataset, it will save the old dataset, and so on. Your repositories can get massive/unworkable very easily.
I’m fine moving it to gitignore, but is that really a feasible solution here? I don’t need version control for the data itself, just a convenient way for all authors to load it.
If you really want it to be on demand, then you can put it in a SQL server and host it via AWS or something. Generally for academic projects I just put a saved data file in Google Drive/Dropbox.
unless of course as people mention you keep your big data somewhere else and add it to the repo as needed. Github's features are only really useful for code/articles anyway.
We host the original data in a separate cloud and load it only for the first processing steps. But if you need the full data for estimations....I have no idea!
Yeah I support all the below. Git (and thereby GitHub) is for code, not data (nor output nor graphs).
Ideally, if your data is public, there repo ought to include a script to download, process (and eventually store locally for convenience) - that ensures replicators can follow every step
If the data is sensitive/not public - well it should not be public on GitHub ;)
If the data is somehow manually collected, yap, sharing it in other ways (e.g. through own website if not massive) could be a way, or Dropbox among coauthors.
pins https://pins.rstudio.com @posit.co allows smart sharing (w/ caching, versioning, etc) of large data across cloud services (e.g. Dropbox, Drive, etc)
I connect the project's GitHub repository to an OSF project and store the data at OSF since they have big file size limits. I then use {osfr} to download the data from OSF to a gitignored folder in my GitHub repository.
I add the GitHub repo as a storage provider, and I *think* OSF pulls the different changes into its system (or maybe it just links to GH? idk), and I also use the OSF storage system, so it looks like this
Comments
Ideally, if your data is public, there repo ought to include a script to download, process (and eventually store locally for convenience) - that ensures replicators can follow every step
If the data is somehow manually collected, yap, sharing it in other ways (e.g. through own website if not massive) could be a way, or Dropbox among coauthors.
pins https://pins.rstudio.com @posit.co allows smart sharing (w/ caching, versioning, etc) of large data across cloud services (e.g. Dropbox, Drive, etc)
piggyback https://docs.ropensci.org/piggyback/ @cboettig.bsky.social also allows sharing large datasets through GitHub releases
Has version control.
Osfr - can directly load data.
OSF: https://osf.io/aqvnk/
GH repo: https://github.com/andrewheiss/mountainous-mackerel
GH repo with DOI: https://doi.org/10.5281/zenodo.12817616
So many DOIs 🙃