GitHub was down, hard, for an hour.
These kinds of outages are increasingly unacceptable: a company’s code repo is increasingly treated as infra. Repo down —> no deploys possible (even to resolve outages!)
GitHub doesn’t have infra-level reliability tho, ignoring this reality.
These kinds of outages are increasingly unacceptable: a company’s code repo is increasingly treated as infra. Repo down —> no deploys possible (even to resolve outages!)
GitHub doesn’t have infra-level reliability tho, ignoring this reality.
Comments
This isn't githubs fault it's yours
In the meanwhile GH/MS is pushing AI, maybe first push reliability …
if only...
Like yes this broke a ton of stuff BUT will people actually move to another provider. There's lots of other providers out there - GitLab, Bitbucket to name a few.
My guess is this 1h outage is painful but few will actually switch.
Or do emails, do the expenses you've been procrastinating on for weeks , or heaven forbid talk to your colleagues.
And then things like this happens and they fall down from the ivory tower and we see that they behave as any other small start-up
Gitlab exists.
Yeah it’s def connected. GH offers great value… but there’s a risk as we see
(And very likely well within the SLAa between them and their customers. Yes, this is both snark and also not.)
Looks like it checks in 1-2 times a day, days with more than 2 commits are where GH has reported an incident. Some are minor, some are not.
Yes it’s bad for critical infra… but perhaps most dev teams are incorrectly assuming they get high reliability on a cheap service with no strict SLAs? And need to pay up for high reliability?
https://bsky.app/profile/gergely.pragmaticengineer.com/post/3lfotgsq3tc2u
Less exposition compared to GitHub, but the same issue for the company that are relying on git deployment 😇. (yes I need to merge my PR in order to ship my code and I can't!)
Teams need disaster response scenarios, so if GitHub is still down, they can still deploy, even if it's manual.
Mirror git locally. Or to gitlab. I worked in a place that didn't care if us-east-1 died, so did their customers. :-(
Call me GenX but if it’s really that critical to your operation, you should figure out how to cover your own ass.
There’s very much a trade-off of availability vs. feature velocity and cost.
A load balancer config change meant GH bounced all requests.
For an hour.
Yes incidents happen but this is simply not now how a mission-critical service should operate.
https://www.githubstatus.com/incidents/qd96yfgvmcf9
“not how a mission-critical service should operate” — you mean they shouldn’t make changes to their configs? I genuinely have no idea what this means.
You'd hope that'd be a 5-10 minute sort of thing instead.
In the note GH seemed to agree that they want to make changes so detection + mitigation will not take this long
My point was an hour to catch a change that broke he’ll seems too long
I’ve been involved in incidents like that, and either a significant time delta or other subsequent changes since that change can make it hard to know which change to roll back.
Perhaps you cannot expect high uptime: but you need to invest more to get this
https://bsky.app/profile/bigvalen.bsky.social/post/3lforx35pis2k