Folks who use #rstats with github, how am I supposed to be managing the data for my project with 100mb file size limit? Am I going about this all wrong? - ThreadSky

sydneyg.bsky.social • 250 days ago

Folks who use #rstats with github, how am I supposed to be managing the data for my project with 100mb file size limit? Am I going about this all wrong?

Comments

floswald.bsky.social•250 days ago

You can use git-lfs (large file storage). But yes the real question is why you want to put that data on GitHub; we care about tracking changes in code usually. Data is static. Maybe a cheap S3 bucket more suitable? Dropbox?

sydneyg.bsky.social•250 days ago

Fair enough, there are no changes in the data, more about efficiently loading it across collaborators.

floswald.bsky.social•250 days ago

I wasn't meaning to say that putting it on github via git-lfs makes no sense; been thinking about it myself at times. FWIW sharing raw data on dropbox worked well for me so far.

sydneyg.bsky.social•250 days ago

Dumb question but do you then load the data manually each time? Or is there an easy way to set the wd to Dropbox?

floswald.bsky.social•250 days ago

I'm glad you asked 🤗. There are two components, i post both as pictures. First each collaborator is instructed to set an environment variable pointing to the local dropbox root (via an instructive error). Once you have that, the `paths` function will give you whatever you need inside the data.

sydneyg.bsky.social•250 days ago

Perfect! Thank you so much!

alexpghayes.com•248 days ago

How big is your data? I would save it, check size, figure out how many pieces to split it into, and then save data as many chunked files

rmkubinec.bsky.social•250 days ago

There's a reason you don't want big files in Github and that's because it will record *every* change to every file. So if you change one data point in a big dataset, it will save the old dataset, and so on. Your repositories can get massive/unworkable very easily.

sydneyg.bsky.social•250 days ago

I’m fine moving it to gitignore, but is that really a feasible solution here? I don’t need version control for the data itself, just a convenient way for all authors to load it.

rmkubinec.bsky.social•250 days ago

If you really want it to be on demand, then you can put it in a SQL server and host it via AWS or something. Generally for academic projects I just put a saved data file in Google Drive/Dropbox.

unjournal.bsky.social•250 days ago

You could include code that pulls down the data from the public Dropbox page too.

rmkubinec.bsky.social•250 days ago

unless of course as people mention you keep your big data somewhere else and add it to the repo as needed. Github's features are only really useful for code/articles anyway.

simonreif.eu•250 days ago

We host the original data in a separate cloud and load it only for the first processing steps. But if you need the full data for estimations....I have no idea!

alemartinello.com•249 days ago

Yeah I support all the below. Git (and thereby GitHub) is for code, not data (nor output nor graphs).

Ideally, if your data is public, there repo ought to include a script to download, process (and eventually store locally for convenience) - that ensures replicators can follow every step

alemartinello.com•249 days ago

If the data is sensitive/not public - well it should not be public on GitHub ;)

If the data is somehow manually collected, yap, sharing it in other ways (e.g. through own website if not massive) could be a way, or Dropbox among coauthors.

frodsan.bsky.social•248 days ago

Two helpful #rstats packages:

pins https://pins.rstudio.com @posit.co allows smart sharing (w/ caching, versioning, etc) of large data across cloud services (e.g. Dropbox, Drive, etc)

piggyback https://docs.ropensci.org/piggyback/ @cboettig.bsky.social also allows sharing large datasets through GitHub releases

pgmj.bsky.social•250 days ago

Using arrow/parquet helps a lot.

sydneyg.bsky.social•250 days ago

Can you elaborate or point me to some docs?

pgmj.bsky.social•250 days ago

It’s a very efficient file format: https://arrow.apache.org/docs/dev/r/articles/read_write.html

andrew.heiss.phd•250 days ago

I connect the project's GitHub repository to an OSF project and store the data at OSF since they have big file size limits. I then use {osfr} to download the data from OSF to a gitignored folder in my GitHub repository.

gaborbekes.bsky.social•250 days ago

Seconding osf. I used it.
Has version control.
Osfr - can directly load data.

andrew.heiss.phd•250 days ago

See https://osf.io/sm5ew/ for a project + https://osf.io/msaz8 for a big .rds file + https://github.com/andrewheiss/who-cares-about-crackdown/blob/26639e33739c4dd47f4e6741eebccfb3cb2b0eed/R/funs_data-generation.R#L6 for code for downloading it

andrew.heiss.phd•250 days ago

It's a little roundabout-y, but it completely avoids storing big data files at GitHub 🤷‍♂️

sydneyg.bsky.social•250 days ago

Thank you!

matti.vuorre.com•250 days ago

This. @andrew.heiss.phd do you also archive the GitHub repo on OSF somehow? I've used the registration method before but it is incredibly clunky.

andrew.heiss.phd•250 days ago

I add the GitHub repo as a storage provider, and I *think* OSF pulls the different changes into its system (or maybe it just links to GH? idk), and I also use the OSF storage system, so it looks like this

andrew.heiss.phd•250 days ago

I mostly don't really care if the latest *code* is live at OSF, just the big files. The latest code just lives at GitHub (which I DOI-ize with Zenodo)

OSF: https://osf.io/aqvnk/
GH repo: https://github.com/andrewheiss/mountainous-mackerel
GH repo with DOI: https://doi.org/10.5281/zenodo.12817616

So many DOIs 🙃

Comments

Posting Rules

Reply