An unordered / non-exhaustive list of things that helped scale Bluesky's infra efficiently: + Exiting the cloud (colocation) + HAProxy w/many Node backends + Go w/clever code + ScyllaDB + SQLite w/per user databases + Redis w/many instances + AMD servers w/many cores + Purchasing bandwidth directly - ThreadSky

jacob.gold • 200 days ago

An unordered / non-exhaustive list of things that helped scale Bluesky's infra efficiently:

+ Exiting the cloud (colocation)
+ HAProxy w/many Node backends
+ Go w/clever code
+ ScyllaDB
+ SQLite w/per user databases
+ Redis w/many instances
+ AMD servers w/many cores
+ Purchasing bandwidth directly

Comments

mohsinhijazee.bsky.social•192 days ago

SQLite database per user? That's mind boggling. Curious to know more.
And amazing choices especially Go.

wqsiqrr.bsky.social•192 days ago

Also what sort of data is being stored there

jacob.gold•192 days ago

SQLite is being used by the PDS to store all of our posts and likes etc. It’s the most important stuff!

stephjmom.bsky.social•186 days ago

📌

rv3en5.bsky.social•199 days ago

📌

anarcher.dev•192 days ago

When you need to restart or replace a node that has PDS, what approach do you use to do it without downtime? :)

patrick-ohlson.ms-dot.net•193 days ago

📌

richburroughs.dev•193 days ago

📌

nuke.blue•178 days ago

Exiting the cloud and HAProxy are music to my ears, as someone who has felt increasingly trapped by cloud providers over the course of the last decade.

Gonna re-investigate this at work.

jmprusi.bsky.social•193 days ago

📌

digitalkaoz.net•195 days ago

Haproxy my newly discovered favorite toy. Had the chance to use it at work again, damn its a so good piece of Software ♥️

janaka.bsky.social•193 days ago

Can you/anybody share more about the DC setup. Generally curious how much is done in house? Renting empty racks or server build+rack+remote hands service from provider? How many geo locations? Which countries? What does network interconnect look like?

jacob.gold•193 days ago

Sure this is stuff that was mentioned previously: Empty cages. Redundant 100 Gbit/s circuits. Two PoPs west and east US. It’s not super complicated just really powerful AMD machines with lots of RAM, 25G networking, all NVMe storage.

One novelish aspect is there’s no routers per se just Linux boxes

janaka.bsky.social•192 days ago

thanks. sorry. did spot prev. I guess these days CPUs + NICs are powerful enough at commodity price they can handle routing throughput.

daniskitchen.bsky.social•200 days ago

*nods knowingly without any understanding of any of that*

wilhelm.codes•199 days ago

cforde.github.io•191 days ago

"SQLite w/per use databases"? Interesting. What/how does that help scalability? Downsides?

the-ballmer-peak.bsky.social•164 days ago

SQLite is really surprising to me.

jacob.gold•164 days ago

Works really well for the atproto PDS use-case because each user's repository is very much an entirely separate database.

justingarrison.com•193 days ago

📌

ahumbleman.bsky.social•199 days ago

📌 right forgot to pin

ahumbleman.bsky.social•199 days ago

Thank you mightily

manaljamal95.bsky.social•7 hours ago

Hi 💛
I’m a mother from Gaza, trying to rebuild a safe and dignified life for my daughter after surviving the war.
Any support — even just sharing the link — means the world to us.

https://gofund.me/b24c2db2

🌸🙏

rodrigok.me•192 days ago

SQLite per user is definitely a interesting surprise.

How do you deal with multiple instances reading/writing to it? How about replication or backup?

jacob.gold•192 days ago

SQLite has no issue with concurrency especially in WAL mode. Only one writer can write at a time but others can just wait their turn. Readers don’t block at all. It’s great!

rodrigok.me•192 days ago

Sorry, I meant multiple networked instances that don’t share a local filesystem. I thought SQLite did not have a server mode? Or is my knowledge out-of-date?

jacob.gold•192 days ago

You’re right it’s an embedded database library not a client server database. Each user is on a single node so there’s no network component to the data layer. Which is why it’s so wildly efficient.

rodrigok.me•192 days ago

Gotcha! That’s interesting. Thanks for taking the time responding!

borja.dagi3d.net•182 days ago

Thanks for sharing. And what about things like schema migrations? How long does it take?

rassbariaw.bsky.social•190 days ago

🤔

Reads like ancient Greek for us non-technies

nebulousgrey.bsky.social•200 days ago

acab.dad•200 days ago

Exiting the cloud is key. It’s really expensive to keep those computers up so high in the atmosphere

winrex2007.bsky.social•77 days ago

Good day
New to this platform

In my profile
I will be talking on crypto currency , Trading and how will can benefit maximally from it .
You can give me a follow
Let that hunt the

mikeythedude.bsky.social•192 days ago

Not using any cloud resources (AWS, Azure, etc.)?

sszuecs.bsky.social•189 days ago

Node oh 😮, do you have more than 10 rps per instance? Do you do graphql?

vicnastea.io•200 days ago

I’m impressed by what y’all have built!

What is some of the cleverness in the Go code?

jaz.bsky.social•200 days ago

Generally leaning into helpful datastructures to purpose-build scalable services like the Graph service - https://jazco.dev/2024/04/20/roaring-bitmaps/

Among many other things like Bloom filters, restructuring workloads to favor very high concurrency, etc.

digitalkaoz.net•195 days ago

Nice read!

cursed.monster•191 days ago

Bloom filters are so neat! Relatively few uses, but for the ones they fit, they’re magical.

cursed.monster•191 days ago

And Roaring bitmaps *are* magical.

raphaeltm.com•193 days ago

Fun read. Thanks for sharing!

yedragg.itch.io•185 days ago

📌 🔧🪛⚙️🦾

vicnastea.io•199 days ago

Awesome stuff. Thanks for the pointer!

eribeiro.bsky.social•193 days ago

📌

winrex2007.bsky.social•77 days ago

Good day
New to this platform

In my profile
I will be talking on crypto currency , Trading and how will can benefit maximally from it .
You can give me a follow
Let that hunt the

manos.lol•197 days ago

📌

ollieshouse.bsky.social•193 days ago

@jacob.gold
What are you using to orchestrate server workloads behind HAproxy? (Eg, containers, k8s or just plain VMs/metal?)

jacob.gold•193 days ago

I set it up with just containers on bare metal but it hardly matters. Mostly a matter of preference and k8s or Nomad or wherever would work fine.

rafael.my•193 days ago

📌

dbgr.bsky.social•193 days ago

SQLite with per user database.

Curious, How do you do SQLite persitence?

janaka.bsky.social•193 days ago

https://github.com/bluesky-social/pds

janaka.bsky.social•193 days ago

It's in each users PDS. Which you can self host by running a container afaik. Guessing that means a direct attached disk (?) as sqlite doesn't like NFS. A little detail about SQLite use here

https://open.substack.com/pub/pragmaticengineer/p/bluesky?utm_source=share&utm_medium=android&r=ch6l7

dbgr.bsky.social•193 days ago

Thanks

ryan.crawcour.social•200 days ago

Sounds expensive. Who's paying for it all?

jacob.gold•200 days ago

Yeah, things are always relative! Bluesky Social, PBC (the company) has raised money. The protocol (atproto) and app built on it (Bluesky) are fully open networks that anyone can build/participate in.

ryan.crawcour.social•200 days ago

Someone is gonna wanna be paid. Unfortunately that's how capitalism works.

jacob.gold•199 days ago

Sure, but there are win-win scenarios and the protocol and network was designed to avoid this being a problem, by being “locked open” in case the company that launched it ever became evil. Doesn’t mean it’s not a valid concern though!

elchefe.me•193 days ago

This looks like a fairly cost-optimized platform, tbh

aparker.io•193 days ago

honestly a testament to their SRE that it’s not falling over with those constraints and the kind of growth its seeing

ryan.crawcour.social•193 days ago

Sure, but someone still has to pay for it and some stage. Those VCs are gonna want their payback real soon.

auggie.dev•193 days ago

📌

walkingalchemy.com•193 days ago

📌

maeddes.bsky.social•192 days ago

Are there any public architecture kind of diagrams available?

jdd.me.uk•193 days ago

People forget cloud was originally supposed to be like a contractor. Expensive but get it when you need extra capacity. Now companies do everything in cloud. For compute you will definitely require 365 days a year you should consider colocation.

jacob.gold•193 days ago

It’s fine to host everything in the cloud but there are thresholds where it becomes prohibitive. Some companies make so much money it doesn’t matter but for Bluesky it would have been a real problem.

jdd.me.uk•193 days ago

I have definitely seen companies that make a lot of money and then their attitude is just, who can be bothered to look after a collection of colocated servers - I'd rather pay cloud. There is some truth to that but for Ram or CPU heavy stuff colo can be a saving. Also sysadmin salaries are a factor

iamstan.elmo.sh•199 days ago

Experimented with Valkey over Redis yet? Apparently new versions of Valkey are significantly more efficient than Redis.

jacob.gold•199 days ago

Looked at it but not used it, I assume it's still single threaded for operations though?

That's the main limitation with Redis on a machine with many cores, you can't really leverage them!

So Bluesky uses sharding and large number of Redis instances on a single physical host, to utilize the cores.

iamstan.elmo.sh•199 days ago

They've improved concurrency and memory utilization. Claim almost 4x improvement on queries per second.

https://valkey.io/blog/valkey-8-0-0-rc1/

jacob.gold•199 days ago

Thanks for the info, good to know, I definitely had "check out Valkey next time I need Redis" on my list. Being fundamentally single threaded (a single "main thread") still means you'd to want to run many instances and use sharding to utilize the CPUs on e.g. a 96 core system.

trvrm.bsky.social•173 days ago

Do you happen to know if these efficiency gains include streams? I've found redis streams an extremely valuable tool.

iurii.net•191 days ago

I’m so happy SQLite is coming back to spotlight, especially in high-load scenarios.
Easily one of the best storage engines out there.
Did you know it even supports window functions and CTEs?

rswestmore.land•200 days ago

Are you running into lock issues with sqlite, or is everything cached in front of it and only committed for repopulating the cache?

jaz.bsky.social•200 days ago

The sqlites are one-per-user on the write path and there isn't really much single-user concurrency when it comes to writes so we haven't seen any issues there at all. The PDSs themselves have a shared sequencer SQLite but it's in WAL mode and the writes are small, each PDS only does <50 write/sec.

rswestmore.land•200 days ago

Interesting, I might have to give this a try.

Most of my experience is based on wazuh's design, it now uses per-agent sqlite, and frequently runs into problems. But it's very write intensive.

I've written my own ledgers based on ext4 directly, when mariadb is overkill, but time to give sqlite a go

jaz.bsky.social•200 days ago

Check out https://phiresky.github.io/blog/2020/sqlite-performance-tuning/

More DBs is always better if you don't _need_ to have all your data in one sqlite database file. If you have multiple tables and don't have to run transactions modifying both of them, use more than one sqlite file each with their own writer!

suresh.dev•200 days ago

Doesn't the per user db make the aggregation a lot more expensive and difficult to scale? Any information that needs to join multiple users DBS (say trending msg),how does that aggregation work and make sure it's latest and up to date? Sorry I am new to AT and bluesky,please excuse my noob questions

jaz.bsky.social•200 days ago

Check out this article, it explains the architecture pretty well:

https://atproto.com/articles/atproto-for-distsys-engineers

typonomy.bsky.social•199 days ago

Wow, great article that explains the architecture incredibly well with minimal technical language. Gotta love the illustrations as well

richschu.bsky.social•191 days ago

Do you have any blogs/posts/etc that go into more detail? Regardless, thanks for sharing these with fellow infra nerds.

rswestmore.land•191 days ago

Jaz noted one here https://bsky.app/profile/jaz.bsky.social/post/3lanhq352jc2e

richschu.bsky.social•191 days ago

Thank you!

anitramwaju.malauren.be•200 days ago

But isn't it what makes allowing remigrating users to Bluesky PDSs difficult for now?

jaz.bsky.social•200 days ago

No that's just the CAR library we're using that says it's async but isn't actually async and locks up the main thread of a PDS container. Nothing to do with the SQLite stuff, just a package we probably need to rewrite ourselves or fork.

tom.sherman.is•199 days ago

Yikes, do you have a GitHub issue link for this?

iamavieira.com•199 days ago

📌

chris.blue•193 days ago

What benefit did exiting the cloud have?

jik.wtf•10 days ago

$$$

jik.wtf•10 days ago

oops, sorry for necro

yunghollow.bsky.social•199 days ago

per user database? That can't be right?

jacob.gold•199 days ago

Yup, there are almost fifteen million SQLite databases on Bluesky’s PDS servers. It’s wildly efficient and simple but not without trade offs of course.

Makes sense for this use case in large part because each users atproto repository is self contained, with links to other repos, like a website.

jacob.gold•199 days ago

Oh I should add that the data is aggregated on a separate service (the AppView) for the app to consume in a friendly and high performance way.

On that service it’s stored ScyllaDB, which is on the other extreme of database systems 😏

bethcodes.bsky.social•193 days ago

That’s also convenient for db-inclusive unit testing 😍

mikecherry.bsky.social•200 days ago

Well done, sir. Well done.

amuzi53.bsky.social•200 days ago

Ready to support on language access.

gungle.bsky.social•150 days ago

I really don't know what these are but they sound important so I'm glad these things were implemented

mattmcintire.com•196 days ago

Love seeing the comment about AMD servers with many cores. Any issues with memory bandwidth on your servers - or are you largely memory capacity and CPU bound?

nican.net•200 days ago

1. Are you paying for SycallaDB?

2. How are you storing the Sqlite files? On Sycalla? Or NFS? Hosted on top of what file storage solution?

lykron.bsky.social•200 days ago

SycallaDB has fairly extensive docs on what storage it likes.

Spoiler: the faster and closer to the PCIe bus, the better.

lykron.bsky.social•200 days ago

And you wouldn’t want to run it over network storage.

SycallaDB can cluster at the application level, so local storage all the way

jacob.gold•200 days ago

1. (not on the team anymore, so won't answer for them)

2. SQLite is used on the PDS, stored on the local (NVMe SSD). Each (of us) users has our own SQLite database for our atproto repository data.

Here's the original PR where @dholms.xyz and @divy.zone did this work: https://github.com/bluesky-social/atproto/pull/1705

danielhe4rt.dev•200 days ago

AFAIK they’re using the ScyllaDB OSS

(I’m not on the team as well)

jaz.bsky.social•200 days ago

Yeah, we're using FOSS Scylla right now, not enterprise.

The SQLites are just stored on disk for the PDSs (not in the same place as where Scylla runs) and have backups plus litestream on the PDS-wide DBs which allow us to recover from outages by playing back the event stream for the PDS.

nuclearpidgeon.bsky.social•189 days ago

Is there any sharding or consistent hashing involved in how the PDS SQLite files are distributed across the servers? Or are accounts just put on a particular server during signup and then that one server is always queried for your account/post data?

jaz.bsky.social•189 days ago

For now accounts get assigned a PDS and stay there. We eventually want ways to load balance users across PDSs based on activity/locality etc. but we're not there yet.

nickdodd.com•200 days ago

Nice

hicksca.dev•193 days ago

The SQLite is the only surprising part. Would love to lean more about the how's and why behind SQLite decision.

dane.computer•193 days ago

Ditto, Ive heard a out the sqlite db per user strategy from a friend that works at cloudflare but not sure how it works in practice. Super curious

r-s-s.bsky.social•196 days ago

Why can't Bluesky preserve a blank line inside a post?

jacob.gold•196 days ago

Is

that

true?

r-s-s.bsky.social•196 days ago

Try it.

Write the same reply but try to copy the whole text before you post it.

If you paste it into another new post, the blank lines disappear.

jacob.gold•196 days ago

Pretty sure that happens on Twitter too. I agree it's annoying. Something to do with the browser copy/paste behavior? Anyway, not something I'd know about and I'm no longer on the team so I can't post it on Slack 😉

r-s-s.bsky.social•196 days ago

I know twitter does it too but It didn't used to before.

I like to copy a reply and repost it in other similar threads.

It's no big deal. I can reinsert the blank line. It's just annoying.

You may be right about the browser being the culprit. I use firefox.

jacob.gold•196 days ago

Got me curious, so I looked into it a little. It's all related to the fact that these text input boxes aren't really normal browser HTML forms. They're content editable divs with semi-complex code operating them. So they just don't behave like you'd expect.

Probably will get fixed one day!

t1c.dev•196 days ago

Give DragonflyDB a look instead of Redis/Valkey, I've found it to be way faster and lighter weight than Redis in my past work!

dragonflydbio.bsky.social•184 days ago

Thanks for the shoutout!

t1c.dev•184 days ago

Glad to see you here! Btw, to verify yourself, you should set your handle to @dragonflydb.io

doctorjimmy.bsky.social•191 days ago

I have zero idea what any of those things are. 😂

jacob.gold•191 days ago

doctorjimmy.bsky.social•191 days ago

rigby.sh•199 days ago

I'm curious (if you can answer) how are likes handled with the individual SQLite databases? Specifically the aggregation count. I'm assuming that the count is really just the sum of the events downstream from the firehose and my likes are only stored on my PDS, but is there something I'm missing?

jacob.gold•199 days ago

Aggregation is done by the App View service, basically it "crawls" all of the "app" data on everyone's PDS and creates aggregate "views" (like counts, etc).

https://atproto.com/articles/atproto-for-distsys-engineers

Comments

Posting Rules

Reply