GitHub was down, hard, for an hour. These kinds of outages are increasingly unacceptable: a company’s code repo is increasingly treated as infra. Repo down —> no deploys possible (even to resolve outages!) GitHub doesn’t have infra-level reliability tho, ignoring this reality. - ThreadSky

gergely.pragmaticengineer.com • 45 days ago

GitHub was down, hard, for an hour.

These kinds of outages are increasingly unacceptable: a company’s code repo is increasingly treated as infra. Repo down —> no deploys possible (even to resolve outages!)

GitHub doesn’t have infra-level reliability tho, ignoring this reality.

Comments

bcarl314.bsky.social•44 days ago

Git is a distributed system. If your org doesn't have a solid DR plan in place, this will be a minor inconvenience compared to what could heppen.

arathunku.com•45 days ago

I'll still take GitHub any freaking day than deal another day with self-hosted GitLab. Based on painful experience.

katafrakt.bsky.social•45 days ago

Remember that we have free private repos, actions, threaded conversations and probably few other things on Github mostly thanks to Gitlab 😛

arathunku.com•45 days ago

I don't know tbh, maybe? I think GitHub would have those too anyway due to market demand, perhaps only later? I don't miss buggy GitLab features. I'd gladly share with you list of issues where I'm subscribed but 8y later I'm still waiting for https://gitlab.com/gitlab-org/gitlab/-/issues/14972 to be fully resolved.

hutchybear.bsky.social•44 days ago

The reason the market demanded it is because their competitor offered it.

holta73.bsky.social•44 days ago

If you're storing all your code on a repository then you don't have a copy of your code.

This isn't githubs fault it's yours

renedeanda.com•45 days ago

📌

martinbeentjes.bsky.social•45 days ago

We’re working on BitBucket, and with all the recent upgrades it steps up into GitHub. And I already experienced the instability of GH, I am not really missing it.

In the meanwhile GH/MS is pushing AI, maybe first push reliability …

scyllafren.bsky.social•44 days ago

So you have a single point of failure in your workflow, and you did nothing to create failovers? It's on you, bud.

kanova.bsky.social•44 days ago

I can’t even imagine the incompetence of having no access to your own code when GitHub is down.

karlpetermichl.bsky.social•45 days ago

📌

hawkestalon.bsky.social•44 days ago

If you are a "high velocity" dev team, then you should be capable of being productive even if you can't push code.

gtatarsky.bsky.social•44 days ago

I hear “high velocity” and alarm bells go off about how none of their branches are protected and nobody is doing a proper code review :)

hawkestalon.bsky.social•44 days ago

If you are pushing and merging hourly then I have serious questions. Pushing and merging is like <1% of my job as a software engineer.

seinlet.eu•44 days ago

Git is, by design, able to sync with multiple remotes. If outage is business critical, use multiple remotes.

newhook.bsky.social•43 days ago

The problem isn’t the code it is all the processes that are tied to GitHub such as PRs, reviews, compliance checks, GitHub actions, deploys and so forth.

locnar1701.bsky.social•44 days ago

If only you could host something like that locally, or co-located, or on a network you can control...

if only...

quinnbrock.bsky.social•44 days ago

Hey, I'm not in the industry, but to me, that span of time doesn't seem significant. Could you frame how you see that hour, so I can better understand the damage?

dave.mittners.net•44 days ago

Yeah, that sounds like someone wanting to complain disproportionate to the actual event. No team should be rendered completely ineffective by one hour of Github being down and if his team can be then I think HE needs to be thinking more about resilience and stop putting the blame solely on others.

johnadavis.bsky.social•45 days ago

I was wondering when we’d get the holiday code change outage. Happened last year and is nearly universal across the industry. I find it fascinating to watch how many services go down in January.

sirhamy.bsky.social•45 days ago

"Unacceptable" is relative I think.

Like yes this broke a ton of stuff BUT will people actually move to another provider. There's lots of other providers out there - GitLab, Bitbucket to name a few.

My guess is this 1h outage is painful but few will actually switch.

petzlux.bsky.social•45 days ago

Bizarre that they couldn't work on things not dependant on GitHub for an hour ?! Heck, you can write code and push to your local git for an hour without needing GitHub ?

Or do emails, do the expenses you've been procrastinating on for weeks , or heaven forbid talk to your colleagues.

edniedziejko.bsky.social•44 days ago

if you have time sensitive coding then use github as a backup/archive. relying entirely on it being always available is short sighted.

marcosflobo.bsky.social•45 days ago

It's quite surprising how we assume that important companies like GitHub have a robust way to work and scale....

And then things like this happens and they fall down from the ivory tower and we see that they behave as any other small start-up

jaharkema.net•44 days ago

> It's time for GitHub to take reliability seriously. Or a new player who can

Gitlab exists.

katafrakt.bsky.social•45 days ago

Weren't you singing praises to Github just the other day, how it won as the obvious answer of where to host the code etc.? These things are connected.

gergely.pragmaticengineer.com•45 days ago

Yes! How most teams and devs use them.

Yeah it’s def connected. GH offers great value… but there’s a risk as we see

hmans.dev•45 days ago

I understand your frustration, but a sole hour of downtime sounds pretty tame, considering that this isn't happening on a regular basis.

(And very likely well within the SLAa between them and their customers. Yes, this is both snark and also not.)

rschristian.dev•45 days ago

GitHub as a whole does have pretty frequent outages & issues. This repo tracks their status page, logging incidents over time: https://github.com/outages/github-outages/commits/main/github_outages_v1.json

Looks like it checks in 1-2 times a day, days with more than 2 commits are where GH has reported an incident. Some are minor, some are not.

gergely.pragmaticengineer.com•45 days ago

I typed out my reaction and thoughts and it changes as I think it through:

Yes it’s bad for critical infra… but perhaps most dev teams are incorrectly assuming they get high reliability on a cheap service with no strict SLAs? And need to pay up for high reliability?

https://bsky.app/profile/gergely.pragmaticengineer.com/post/3lfotgsq3tc2u

buabon.bsky.social•44 days ago

How about giving this dude's self-proclaimed high velocity team a well-deserved break rather than wringing every last drop of productivity out of them.

sidoine.org•38 days ago

a few days after, it's time for BitBucket to get a "hard down" for a couple of hours: https://bitbucket.status.atlassian.com

Less exposition compared to GitHub, but the same issue for the company that are relying on git deployment 😇. (yes I need to merge my PR in order to ship my code and I can't!)

principalengineer.com•45 days ago

Github Actions is down regularly between 15-18 CET. This is particularly nerving in places that do continuous delivery, or have busy hours during those times, potentially requiring production adjustments.

aniketsk.bsky.social•45 days ago

Too many engineers fired is making all software products worse

bigvalen.bsky.social•45 days ago

Large scale distributed systems are hard. They will go down.

Teams need disaster response scenarios, so if GitHub is still down, they can still deploy, even if it's manual.

Mirror git locally. Or to gitlab. I worked in a place that didn't care if us-east-1 died, so did their customers. :-(

ogpaka.bsky.social•44 days ago

You could also try being proactive and have your DevOps Engineer setup a secondary remote, instead of creating a single point of failure by relying solely upon GitHub.

Call me GenX but if it’s really that critical to your operation, you should figure out how to cover your own ass.

ogpaka.bsky.social•44 days ago

Try asking ChatGPT, “Sometimes GitHub goes down and people cannot deploy code. What is a way to setup a secondary approach so GitHub doesn’t become a single point of failure?”, for further guidance.

catelandaxel.bsky.social•45 days ago

This is clickbait level outrage.

piki.dev•45 days ago

How many nines do you think GitHub should aim for? When I was there (pre 2020), the answer was three, but the goal might be higher now, and it might be different per service.

There’s very much a trade-off of availability vs. feature velocity and cost.

mediafinger.com•45 days ago

All larger companies I've worked for were self-hosting GitHub Enterprise. All outtakes belong to them.

mediafinger.com•45 days ago

While "outtakes" are usually funny, I obviously meant the more serious "outages".

gergely.pragmaticengineer.com•45 days ago

The incident summary shows how immature GitHub operations are.

A load balancer config change meant GH bounced all requests.

For an hour.

Yes incidents happen but this is simply not now how a mission-critical service should operate.

https://www.githubstatus.com/incidents/qd96yfgvmcf9

t.heckman.io•45 days ago

This is pretty interesting hot take without any actual details on the factors that lead up to the incident.

“not how a mission-critical service should operate” — you mean they shouldn’t make changes to their configs? I genuinely have no idea what this means.

rschristian.dev•45 days ago

I imagine he means that a simple rollback of the config change probably shouldn't take an hour, especially not for a service that holds up a considerable percentage of the world's software.

You'd hope that'd be a 5-10 minute sort of thing instead.

gergely.pragmaticengineer.com•45 days ago

Yes I meant detecting a change broke all your customers, and then rolling back should not take this long.

In the note GH seemed to agree that they want to make changes so detection + mitigation will not take this long

My point was an hour to catch a change that broke he’ll seems too long

t.heckman.io•45 days ago

I think we might be making assumptions here. What if the change was latent, and the start of impact was later?

I’ve been involved in incidents like that, and either a significant time delta or other subsequent changes since that change can make it hard to know which change to roll back.

piki.dev•45 days ago

Latent effects are an engineering choice. People who deploy should know how long they need to monitor after making a change, and teams should design to shorten those feedback cycles.

mkpanq.com•45 days ago

What’s about GitLab? Is this healthy alternative to GH ? Not sure about their stability but that’s only competitor I know and would consider

rogersherman.com•45 days ago

I really like their (gitlabs) CI/CD...I'd consider switching there just for that.

gergely.pragmaticengineer.com•45 days ago

On the other hand: perhaps most customers expect too much for a service that is priced relatively cheap, and offers no strict SLAs (it cannot!)

Perhaps you cannot expect high uptime: but you need to invest more to get this

skwashd.wtf•45 days ago

Waiting for the account management team to start pushing “GitHub Advanced Reliability”. It adds 50% to the bill for a few little features. Management will insist we pay to get the extra reliability but in practice we will get very little for our money.

gergely.pragmaticengineer.com•45 days ago

welcome to how enterprise pricing generally works :D

skwashd.wtf•45 days ago

Unfortunately, I know exactly how GitHub's enterprise pricing works.

gergely.pragmaticengineer.com•45 days ago

By investing more, like this:

https://bsky.app/profile/bigvalen.bsky.social/post/3lforx35pis2k

samsonhu.bsky.social•45 days ago

Always need a back door if you go all in on gitops

heatherether.bsky.social•45 days ago

We were talking about your thread today at work. GHs features are very sticky, getting out of Actions alone is difficult to imagine. Would love to see how companies (GH customer) are making themselves more resilient.

schalkneethling.com•45 days ago

Yup, I was about to say having a mirror is a good idea. GitLab is one option, there is also https://codeberg.org/

visgean.bsky.social•45 days ago

Oh somebody should come up with distributed versioning system lol

Comments

Posting Rules

Reply