We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Analysis Summary
Worth Noting
Positive elements
- This video provides a rare, data-driven look at how massive-scale production environments like WhatsApp debug and patch core runtime bottlenecks.
Influence Dimensions
How are these scored?About this analysis
Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.
This analysis is a tool for your own thinking — what you do with it is up to you.
Related content covering similar topics.
Beyond your first NIF - Riccardo Binetti | Code BEAM Europe 2025
Code Sync
Episode 1: Handling Unexpected Errors Gracefully
Exploring Elixir
Parallel Map in Erlang
BEAM Channel - Erlang & Elixir
IEx's Hidden Superpowers - Brett Beatty | ElixirConf US 2025
ElixirConf
Mnesia as a complete production database system | Chaitanya Chalasani | Code BEAM V
Code Sync
Transcript
[music] [music] Thank you. [applause] So uh today I will be uh talking about how uh at whatsApp we've been optimizing uh the bins scheduleuler for some very large machines. So more than 100 cores. So I will start uh by explaining how we found the bottleneck in the first place uh and uh the path to investigating it. Then there will be a little bit of background uh just as a reminder of what uh work stealing is and uh how the scheduleuler works and then there will be the actual meat of the talk uh about the optimizations that we did and the results that we got. So first why did we start working on this? Uh we did not just wake up someday and think let's work on the scheduleuler. Instead, what happened is that we got some new servers that were pretty large. Uh, and the performance was not what we expected from their specific session or at least it was not what we expected when running a long code. Uh, which pointed at some kind of issue with the beam. And one thing that we observed when um uh trying to run services in production on those servers is that uh we hit some bottleneck where the CPU utilization was still pretty low uh and it was the scheduleular utilization that was hitting um uh 100% first. So what are these two metrics? So the CPU utilization is something which is given to us uh by the kernel and uh roughly speaking it is uh the fraction of the time that the CPUs are busy uh doing useful work uh doing uh work executing either the beam or some other uh pro process and the scheduleul utilization on the other hand is something which is provided to us by OTP and it is uh the sum of uh the time which is spent executing code usefully in the beam plus time which is spent uh waiting uh for some lock to become available somewhere in in the runtime. And so since we're uh basically only running the beam, it means that if there is a big difference between these two, the only possible explanation is that we're spending lots of time waiting for locks. So this is one important thing to reme remember whenever you're seeing very highul utilization and CPU utilization is not keeping up it means that you're you're you're stuck waiting for some logs. So okay uh we can just gripe for mutx in the code of the beam uh and then we'll find that there are thousand of them uh they're all over the place. So that's not very useful by itself. Luckily, the beam comes with a very useful tool called lock conting. You have to compile the VM uh in some special mode and then run it uh with the right flags. Uh it's quite slow uh with a 20 or 30% overhead. But it comes with a huge benefit which is that now every single time we take and release a lock, we log a bunch of stats and we have these four functions for accessing the stats. So first we have LCNT collect which starts the collection of the stats. We have LCNT clear to reset everything forget uh everything that we've seen and then and we have LC&D conflicts that gives us some very nice summary uh and then inspect to dive deeper. So what does conflicts give us in this case? We get something like this. Uh so uh each row corresponds to a kind of lock. So here we see for example from the first row that we have some locks called run q we see there are 254 locks with that tag. We see that over the course of this uh run which was like one minute uh we tried taking that lock something like 15 billion times I think if I got the number of zeros right uh we see that 27% of the time when we tried to take that lock it was already taken by someone else and so we had to wait and we see that in total we waited something like I think 13 seconds uh or 131 seconds I'm not sure how many numbers and the last column uh is telling us that uh for every uh microscond that we spent um uh sorry that we spent uh executing code uh protected by that lock. We spent 974 waiting for the lock to become available. So uh pretty much anything above 1,000 in uration is a huge uh bottleneck. It means there there is a big issue. we're spending way more time waiting for the lock. Uh and so here we see that two of these kinds of locks are bottlenecks. There is run Q and process table. Today I will only talk about the first one, but we've also fixed the second one. Okay, so the run Q is the main lock used by the scheduleul. So that's why we had to look at that. But once again we can grab for run q lock and we'll find that it's used in hundreds of places throughout the scheduleuler. So what can we do? That's where we can use lcnt inspect. Uh so here for example we are saying I want more information about all of the times that we took uh the lock called run q and then we have this combine true where I'm saying I don't want to uh separate by instance of that lock uh if there are multiple locks with that name I want all of them together but then locations true says that I want to get information split for each line in the beam which touches that lock to better find out which specific lines are the bottleneck. And uh if we do that with nothing else, we get a very useless result telling us that we are only uh touching that lock in one place which is a small wrapper uh called run q lock and runq unlock. So to get actually useful information, we must inline that wrapper everywhere. And then suddenly we get information just like on the previous slide, but now we get one line for each uh line of code in the beam that calls lock. And so that's an big win because we see that there is this huge bottleneck in one place which is one function called try still task. So at this point before I can say what we did to improve it uh I must first explain what is uh work stealing. So it was already mentioned yesterday uh in quite a bit more depth uh but uh I will still give a quick uh reminder. So the scheduleuler is responsible for deciding which process to run at what time on what core. And so in practice uh there's if we ignore dirty uh scheduulers just one pinned scheduleuler thread on each core and uh each one has some linked list of processes and it goes through them one by one and execute each one for roughly one millisecond. And so uh and uh obviously we don't want one uh core to only work on one or two processes and another core to have like 10,000. So once in a while but quite rarely we have some global rebalancing. But uh that's pretty rare. So what happens when a scheduleuler runs out of work and there is no rebalancing going on. So it doesn't know uh what work to do next. What it does is uh work stealing meaning that it will just somewhat randomly pick some other schedule thread and try to see do you have work that you're ready to do and if you do then let me take a task from you and sorry take a process uh from you and I will be e executing it instead of you and so uh obviously if it cannot find anything because that other scheduleuler is also out of work. It will just try the next scheduleuler and the next and the next and it will keep going until uh it finds a scheduleuler with work to steal. And obviously since multiple scheduulers uh can be touching the same run Q, we need some lock to protect that run Q. So that linked list of processes that are ready to run and it was this lock that was a bottleneck. Uh so what was happening is that you would have one thread with like 10,000 processes ready to run and you would have 100 threads with no process uh to run. And so all the 100 would be uh trying to steal from each of the other 100 before hitting the the one with 10,000. And each time they would each try to take the lock, but they would all be willing to take the same one just to see, oh, there's no work for me to do. Then let's try the same one, the next thread, and so on and so on. So what can we do to improve this? Uh there are quite a few things. Um so for example uh small optimization is that uh previously whenever we would steal work we would first take work from another thread put it into our own run q then go to to the beginning of the overall scheduleuler loop where we would check so do I I have any work in my run q oh yes I have work I I just put it there uh and so obviously that means a out of taking a lock, releasing a lock, taking a lock, releasing a lock, and so on. And so, uh, one simple thing that we can do is just say, uh, if I just stole work from someone else, instead of putting it into my own run Q, uh, and then checking that, oh, my run Q is not empty, I can just execute it directly. uh this does not directly help with um uh the main bottleneck which is this lock but it still means that we're taking the locks less often and so there is a bit less u contention. Uh second optimization is that before we would always steal a single task sorry a single process uh which means that uh if uh one core had 10,000 processes and we have zero we would steal one we would uh execute it for a bit and then unless it creates new processes when whenever that process ends we would be back to having zero process and we would be back to having to test all the scheduulers to find oh right it is this one that has lots of work and we would steal another process execute it and then back and so one thing which is very common in uh work stealing uh scheduulers is whenever you steal you don't steal just one you steal a batch so that it will last for you for a lot longer and you will not have to do work stealing again so soon uh so the general idea is to steal half uh I had to put a limit at 100 because otherwise the work stealing itself takes very long and we end up holding the lock for too long. Uh but still this reduces quite a bit the number of calls to lock. Uh third optimization which is uh a bit uh weirder but which in practice uh was the one with the most impact was whenever we try to uh steal from another thread instead of calling lock and so waiting if someone already holds it we call try lock and so this says that if no one currently has that lock I want it but if someone is already modifying that run Q please just tell uh don't bother waiting for that someone to be done. Just let me keep trying to steal but from someone else. And so we we would uh keep trying until we find someone that no one else is currently stealing from and no one else is currently pushing work to. Uh and uh what helped even more was making it some kind of a first in first out queue uh such that uh if we exhaust all of the scheduleuler threads and all of them uh are busy. So we can't steal from any of them. We just keep trying back from the beginning and we keep looping until we can find one that we can safely steal from without having to uh uh pause and wait. Uh and there are a few smaller things uh things like uh for example uh uh we don't need to check uh whether uh the total amount of work has fallen below some threshold or whether the whole VM is shutting down at every single step. If we're trying to steal from 100 scheduulers, we can check like once at the beginning and once at the end and not at every single step. uh we can move things around instructs so that fields that are often used together share the same cache line to u optimize things a bit and so on. Uh but the biggest one was three that I showed before and especially the third one. So how do we actually evaluate uh this kind of optimization? Sadly we don't have some perfect and representative benchmark. So uh what I had to do was uh get some uh groups of machines in prod and uh modify some of them to run with a new version and some of them to keep running with odd version and then uh make sure that I send the same amount of load to both sets of machine and graph uh what happens to utilization CPU utilization and so on and so it gave a bunch of graphs uh like that. So that's a machine that's pretty big but uh still not that big. And so what we see is that here we have six machines, three uh with optimizations, three without. uh we can see that at uh low to moderate amounts of load there is this huge difference uh where it's almost double uh well not but 25 to 40 uh CPU uh scheduleular utilization and then as the load keep rising uh the gap shrinks which is not very surprising because the more load there is uh the rarer it is that schedulers run out of work and have to resort to work stealing but it does not completely disappear even at very high load. There's quite a bit of noise here so it's hard to see clearly uh but we can zoom in uh and see that there is still a difference of a couple of percent. And uh same deal on a larger machine and only zooming in uh we see for example here that uh we on that machine which has more than 100 cores we always have a difference of about 5% at high load between uh the machine with optimization and without. Uh so you will have to uh trust me on this but uh these are not cherrypicked graphs. uh they've been repeated on a lot of machines on multiple services in prod and we always see the same kind of results. So at low load very big win but it's not really what we care the most about. at high load. However, which is what we uh consider the actual bottleneck, we're seeing a 2% win on small machines and by small I mean dozens of cores and a 5% win on very large machines. So more than 100 cores. Uh so that's good for WhatsApp workloads. What about other workloads? Uh it's hard for us to be sure but uh there is one comment on the first PR uh that implemented these changes by someone from the community who said that on his own workloads he saw an 11% uh throughput improvement. Uh and he was uh nice enough to also do some lock counting experiment showing that um uh we had 4.5 times fewer um uh lock collisions and uh more than 10 times uh less time was spent uh acquiring the locks. Um so uh how can you do your own tests? uh about twothirds of uh these uh optimizations were merged back in April. The last third was merged a few weeks ago. So if you are testing um uh on the latest version of OTP, you should have all of them. Uh you can also quite easily just uh uh cherry pick uh bend the PRs if you want to check just them and see whether you're also seeing some win. uh if you see either a win or uh a loss of performance I would love to hear about it uh discuss it with you and try to see if there is anything that we can do. Uh so finally uh I will discuss a few more ideas which uh I did not put push all the way uh but uh which we might explore in the future if it remains a bottleneck. So uh there was one uh academic paper that I saw that uh discussed the choice between uh waking processes when we send a message to them on the last scheduleuler that they were running on which is what is currently done by OTP and on our own scheduleuler which uh helps with locks because it means that we're only um uh touching other scheduulers for work stealing and not also for pushing work to them. Uh and they said that implemented this that it was a win on a realistic uh code and did not provide the code and it's all dead. So it would be good to actually uh implement it uh behind some flag and be able to test it. Um, another idea would be uh uh when uh whenever we send work to another scheduleuler, we could use some lock free mailbox rather than directly take a lock on its own run q and directly modify the run q uh with the same idea that it would reduce uh how often we touch another scheduleular uh run q. Finally, um, one, uh, issue we have on very large machines is that communicating with the core next to us and with the core on the other side of the CPU does not have the same cost. Uh and so we could try making the scheduleuler aware of that and try to make it so that for example whenever we steal work we uh try first to to steal from our neighbors and only steal from someone on the other side of the platform uh if we can't find anything nearby. So uh the main takeaways from this talk is that uh whenever you see scheduleular utilization being very high and in particular being higher than CPU uh it means you have lock contention in the VM. Whenever you have that you can use the lock counting tool very effectively to narrow down to which specific kind of lock is a bottleneck and which specific line touching that lock is a bottleneck. Uh and then uh hopefully if it is in the scheduleuler, you now have some ideas for what kind of bottlenecks there are uh what kinds of uh things can be done to improve them. And uh if you're running on an old version of OTP and you're seeing uh this kind of issues in theuler, please move to the latest version. Hopefully they will be fixed. Thank you for your time. I'm now ready for questions. >> Thank you, [applause] ROBIN. THIS WAS REALLY GREAT. Made me think a bit of Pur's talk earlier where he talked about energy uh saving energy consumption. When I saw your graphs on the low usage, I was thinking, "Oh, for pair that would be great to save more battery time on his boards, especially in the low usage." But um let's look at the audience here. Who has questions for Robin here on the schedule? >> Come over to you. I saw you as one. >> So when you spawn a process, which scheduleuler does it get assigned to? >> I think by default it is the same scheduleuler that is executing the current process. I've not checked, but I think so. >> Hi there. I got curious by the second lock in the list of locks that you mentioned the the process table was it? >> Yeah. Uh >> with that one. So uh what is happening is that um every few seconds we have some metric code uh which just for logging uh looks at the length of the message queue of each process and just logs it. And the way that it does that is by iterating through every process. And uh and uh in OTP27 the only way to iterate through every process was by taking a look on the entire process table. uh and uh when you have more than 1 million processes that takes a while uh and so now in OTP28 there is another way to iterate through all processes without taking the lock. So it's not atomic but that's fine for that usage. You're welcome. >> Last one. >> Thank you for the talk. So basically if we have a lot of machines then we get in the situation that we have some schedulers out of work not easy. >> So it's not uh uh lots of machines. It's uh whenever you have a machine with many cores uh unless you're very lucky uh there will be more work on some of the cores than on some others. And this work stealing mechanism exists to make sure that all uh cores can find work to do. But sometimes it is itself a bottleneck and so they spend all of their time trying to find work to do uh but wasting their time. >> Thank you so much Ruben. >> Thank you. [applause] [music]
Video description
✨ This talk was recorded at Code BEAM Europe in November 2025. If you're curious about our upcoming event, check https://codebebeameurope.com ✨ --- The BEAM is famous for supporting many lightweight processes. But the scheduler responsible for deciding which of these processes should execute on which core at which time was designed at a time when most servers only had a few cores, and can suffer from severe lock contention issues when used on many-core machines (100+ cores). In this talk I'll explain the main ideas behind how the BEAM's scheduler works, how to discover and investigate lock-contention issues, and then explain some recent changes that mitigated these issues for the BEAM's scheduler. --- Let's keep in touch! Follow us on: 💥 Bluesky: / codebeam.bsky.social 💥 Twitter: / codebeamio 💥 LinkedIn: / code-sync