"Is the fediverse about to get Fryed? (Or, 'Why every toot is also a potential denial of service attack')" by @aral https://ar.al/2022/11/09/is-the-fediverse-about-to-get-fryed-or-why-every-toot-is-also-a-potential-denial-of-service-attack/
Great summary of the problem, although I don't think there is anything malicious behind these performance issues. My hunch (read: wild-ass guess) is that performance has just not been a huge concern for Mastodon until now, and so now we're feeling the effects. Also: it's hard to measure/predict perf in a federated network.
The perf issues are bad enough that Wired is writing about it: https://www.wired.com/story/twitter-users-mastodon-meltdown/
And Mastodon admins are having to get creative: https://blog.freeradical.zone/post/surviving-thriving-through-2022-11-05-meltdown/
Here on toot.cafe, Sidekiq queues have been backed up for days, and I will probably not see your response to this toot until tomorrow. 😅 And that's despite the fact that 1) I bumped the EC2 instance side from t3.medium to t3.xlarge, 2) I added a 2nd Sidekiq process, and 3) registrations have been closed for years. (Just returning users!)
I would be really curious to read an in-depth perf analysis from someone who knows what they're talking about (read: not me) explaining the root cause of the "Mastodon meltdown." E.g. is it:
1) A flaw in how Mastodon prioritizes Sidekiq queues?
2) As instance X gets slow, it slows down instance Y when instance Y tries to contact it? (I.e. is it systemic?)
3) Would exponential backoff help? More relay servers? Admins just learning how to tune Sidekiq properly? Rewriting Mastodon in Rust? 🙃
So, I have never used Sidekiq before but Queues of various kinds, distributed systems and their performance issues used to be my bread and butter at my old job.
From the aral post, I think the most important part is to listen to what the server admin is saying about how the problem occurs:
> (you have 23k followers, let’s assume 3k different servers), as soon as you create the post 3k Sidekiq jobs are created. At your current plan you have 12 Sidekiq threads, so to process 3k jobs it will take a while because it can only deal with 12 at a time.
Right away I am already seeing architectural red flags here. If 12 different other servers that I am trying to push updates to are all currently overwhelmed and timing out, it sounds like the entire process will grind to a halt. But there's reason for that; it's not the server who has a lot of push work to do can't do it all in time!!
The problem is that the logic that the server uses to push updates is highly flawed. First of all, an arbitrary limit of 12 in-flight requests max is pretty damn low. I would adjust that up to 1000 or higher. A raspberry pi can handle 1000s of concurrent http requests no problem. HOWEVER I suspect that "worker count" of 12 or whatever it is, is ridiculously low for some reason -- It sounds like each worker is its own OS thread, so spawning 1000s of them could do terrible things to the poor CPU, cause it to spend way too much time context-switching between OS threads.
This sounds like oldschool 1-thread-per-request concurrency. Still works fine in 2022 as long as the # of things happening at once is close to the # of processor cores you have. But when you start having (or wanting to have, for performance reasons) 1000s of things going on at once, you start looking at implementing asynchronous IO properly.
couroutines, goroutines, greenlets, async tasks, lightweight threads, fibers, actors, event loop, epoll, async io, they go by many different names depending on the language du jour. But they are all ways of having 1 single OS thread doing multiple things at once without triggering OS-level context switches.
Key takeaways so far:
--------------------------------------
The problem is that the code is using 90s concurrency technology
Buying a faster computer wont really fix the problem, although it might ameliorate it a bit
Properly fixing the problem involves refactoring the code so that it can do 1000s of things at once.
Note that doing 1000 things at once is totally normal for modern applications and it does not require a fast computer.
> The problem is that the code is using 90s concurrency technology
Think oldschool apache 1.0 or even older, inetd, versus a modern application like nginx.
The former uses 1 operating system thread per request, while the latter uses epoll / asychronous io / event loops to handle thousands of concurrent requests all on the same OS thread.
This might be a bit of an inflammatory statement but really, Literally ALL the high-scale "fast" web tech uses the multiple-concurrent-things-on-same-thread design. So the fediverse software just has to adopt that if its going to "scale".
Again, It's not a matter of buying a faster computer or using more energy to run the service, it's just a way of having your computer do 1000s of things at once (or even millions depending on the system) without the computer getting bogged down switching between tasks like it does with OS threads.
----------------
Hugo Gameiro of masto.host goes on:
> Then, for each reply you receive to that post, 3K jobs are created, so your followers can see that reply without leaving their server or looking at your profile. Then you reply to the reply you got, another 3K jobs are created and so on.
Another thing I'm noticing here: Really there should be 1 queue for each server that you federate with, or the queue should be somehow "partitioned" on the remote server URL. Instead of spawning a new task for each individual message to a server, maybe it should just build up the events destined for a specific server & then send them all at once in a batch. During normal operation when its not backed up, the batches will only have 1 message, but when it gets backed up, batching could help dramatically. If there are 100 events queued up for the average server, then batching would make the consumption of the queue 100 times faster. IDK if ActivityPub supports this kind of batching tho.
At any rate I've never used Ruby, I could be wrong about a lot of the stuff im saying about OS threads etc with how that relates to Ruby code, but at the end of the day, that 12 number is sticking out like a sore thumb. I'm very very confident that the "actual solution" will be to remove the ceiling on the # of concurrent operations. We have the technology; it's just software. No need to buy a new server.
@shadowfacts @forestjohnson This sounds like Mastodon "relay servers." Which seem deprecated or something? I can't seem to find any info about them. https://discourse.joinmastodon.org/t/where-are-the-relays/1320/25
Sorry about my super long winded, poorly written posts before, the sort of silly analogy I can make goes like this:
In the nation of Mastodon Server #42069, there is a postal service. All outgoing mail goes to the postal outbox. Postal workers drive to the outbox, pick up One (1) message, then read the URL address on it and drive to that server to deliver it. Then they drive back to the outbox & repeat. There are only 12 postal workers. Problem is that when folks follower counts start rising they start getting followers from thousands of different servers, and every time they do anything, it puts thousands of messages in that outbox. Poor 12 workers can't deliver them all 1 at a time even if they were superhuman HTTP client machines.
My opinion: remove the limit on # of postal workers completely, or, if you absolutely can't do that, then try to make more than 1 message per mail truck.
@nolan @shadowfacts
But the "more than 1 message per mail truck" sounds like its not in the protocol. So I think fibers is the only way for mastodon.