We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Analysis Summary
Worth Noting
Positive elements
- This video provides a rare, detailed look at the internal project management structures and 'follow-the-sun' on-call rotations of a successful remote-first company.
Be Aware
Cautionary elements
- The 'cloud exit' narrative is a core part of their marketing; ensure your own team's infrastructure needs match theirs before adopting their anti-cloud stance.
Influence Dimensions
How are these scored?About this analysis
Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.
This analysis is a tool for your own thinking — what you do with it is up to you.
Related content covering similar topics.
Transcript
But I'm on it. Perfect. Okay. So, I see people are slowly leaving Crowdcast coming over to Zoom. We'll wait for that number of participants to grow just a little bit. Sorry about that, you guys. We clearly need um an ops team on the Crowdcast side to help us when we are going through these technical challenges. We couldn't go live for some reason. So, quick pivot to the Zoom. Thank you guys who are with us and um there is a chat so you can still chat in this chat and you can still post questions in the Q&A section. So we'll give it just a few minutes. I'm going to close out of my Crowdcast window. Andre says maybe Crowdcast is on in on Heroku. I I don't know what that means. It's an ops joke. Aaron gets it. Aaron gets the joke. We don't throw shade on other people, though. That's okay. No Heroku shade. No shade. Well, you guys, thanks for being patient with us. Sorry to start late. We are usually very prompt, but we ran into some technical issues this morning, so we are going to switch over to Zoom. Um, I hear chat is disabled. Let me see if I can figure that out. Um, while we're getting started, I'm just going to do a quick intro and then we'll look at the chat on the back end. I'm Kimberly from the 37 signals team from our product team joined by Ashley from customer support. We're also here with Aaron from the ops team. I will let him give you a little bit of a longer intro because we're talking all about operations today. How we use Base Camp internally to run our ops team, keep our sites up and running, not run into technical issues like we had this morning. and uh we'll show you some real projects that we use in Base Camp. Then we're also going to leave some time for questions. I know a lot of you are probably interested in some of the technical details of 37 signals, how we moved out of the cloud. Aaron's team is the one that did all of that work. So, if you have questions about that, we will leave some time at the end to go over that as well. This session will be recorded on this Zoom and um we will post it and send an email out to you. Um Ashley will do some followup after the session. We'll be together about 45 minutes, probably till the top of the hour since we've started a little bit late. And uh we'll join you then. Oh, it looks like my Q&A popup is working. So, if you have questions, pop those in the Q&A section. Ashley, I'm going to turn it over to you. Okay. Um really, if anyone has any additional questions afterwards, not only will you just get that email from me, um or but if you decide, you know, actually, I have a lot of questions about how my base camp account is set up now that I've seen yours, those I love handling them. we will jump on a Zoom call. There will be no technical difficulties or we will handle it over email. So, send us an email to supportbascamp.com. Maybe reference this session or reply to the email that I'll send you. Um, and if all of this feels like too much and this is just uh beyond what you're currently used to, sign up for our basecamp 101 class. It is a pleasant walkthrough. There's also a Q&A and um you can get to that to you can get to that page by going to basecamp.comclasses. Perfect. Um Erin Erin, let's go to you. Why don't you tell us a little bit about your role at 37 Signals, how long you've been with the company, give us the lowdown, and then we'll dive right into some of your base camp work. Sure. How's it going everybody? My name is Aaron Nicholson. Uh I've been with Base Camp for almost 14 years now. And uh I have been the director of operations for four years. And that role basically means that I manage the three regional teams who make up our operations group. and the operations group handles literally everything that goes on within 37 Signals. We we manage bare metal servers. Uh we manage all the software that we run to make uh to make our own internal software be deliverable to all of you and uh and yeah, I think we do it pretty well and I'm I'm excited to uh to talk about the particulars of how we use Base Camp to uh to manage the team and our workflows. Perfect. Let's start, Erin, with kind of what our team looks like on the operations side. I don't have this um graphic to show you, but kind of talk us through the different parts of the world that your team is on and how you handle kind of the 24-hour operations. Yeah, for sure. So, we have three regional teams that I mentioned. We have Team Americas, although I'm the only one of us who is actually in America. Um the uh the three of us in the Americas are uh are Matt who is on Vancouver Island myself. I'm in Chapel Hill, North Carolina, and Victor who's in Brazil. So we handle things during roughly uh America's time zones. Uh we also have our Pacific team. Uh Paul in Singapore, Aaron in uh the Philippines, and um Alex in Japan. And they uh they grab things. Oh, there's our nice Oh, love that. Bring it up. Thanks, Ashley. So, they handle things uh once we sign off uh the we pass the baton, so to speak, across the Pacific over to to team Pacific and they they take it from there. Um and then from there, yeah, we have we have five employees in uh in team Europe in France, the UK, Germany, Spain, and Italy. Um and uh and they take it from there. So, this is a system that has taken us years to to put into place to be able to to to pass the baton across the globe. Um, and that basically lets us do uh what we call follow the sun on call where during the five business days uh even if you're the the primary weekly on call on the team, you're only really responding to alerts and incidents within your own working hours. So uh the idea is that hopefully you can get a good night's sleep. You can have a little bit of time away during the week and the people who are spread across the globe on our fully remote team will manage things for you. Um and then uh you'll take it for the the the weekend when there aren't people working. So an extra challenge of this during this time of year is that we have uh summer hours that are in place right now where most of the company takes Fridays off. Um, in ops we split it between Mondays and Fridays so that we always have somebody around during each of the five business days. Um, and uh, and that makes covering on call and building the on call schedule a little bit more complicated. Um, but it's uh, it's definitely worth it. It has provided I'd say the biggest quality of life improvement uh, for the team since we since we got started. Amazing. And then um, tell us a little bit Erin on your Oh, sorry you guys. I'm popping around different windows. Not my normal setup here. Um Erin, tell us a little bit about how long you've been at 37 signals. I don't know that you said that. Oh, I I did. I've uh Yeah, 14 years. It's okay. 14 years. It seems. Yeah. So many. It's unbelievable that uh that it's been that long, but yeah. And and there are lots of people on in the company and on the team who have been with the company for for a very long time, and we're we're proud of that. Um, one thing I'll say is that at least compared to some of the other groups in the company, it takes quite a while to get on board as a new person in ops. Um, it can take at least 6 to 12 months before you're fully up to speed and are, you know, to the comfort level where you would be uh be thrown into the on call rotation. Um, and so once we get people here, we like to hang on to them forever. We uh we don't want anybody going anywhere. Okay, thanks for that. I did uh figure out how you can chat. So, I think that's set up. If y'all want to pop in there, I think you should be able to chat now. So, I think we figured that out on the back end. Okay, with that, Erin, why don't you go ahead and tell us a little bit about how you guys use Base Camp internally um to do your work. I know if you follow the 37 signals kind of philosophy at all, we work in these six week cycles. One of the questions that we got sent in when people registered was like, "How do you do that for ops? Do you work in these six week cycles?" We do uh kind of I will say we do our best to work in in six week cycles. Um we we try to scope our work such that that they fit within sort of the cycle paradigm. And um as the director, one of my responsibilities is sending our kickoffs and our heartbeats every six weeks at the beginnings and ends of cycles um to you know wrap up what we've done and and what we're planning to do. So we do with a big comma. Uh but there's obviously some work in ops that takes 5 minutes. Um you know we we are a very reactive team compared to a lot of the rest of the company. So um especially if you are the on call person for the week then yeah your you your work probably doesn't fit into a six week timeline. You're focused on um reactive incidents and and things that come up and and problems within that week. So you have to take time out from your cycle work to uh to be the on call person. Um and then the other side of it is that we have many projects that take years to finish. Uh you know the the cloud move took probably a solid 18 months to move the uh the the app and database workloads and then we're still working on getting out of S3 which is something else I can talk about a little later. So we have a lot of very long form stuff that that doesn't fit into a six week time frame, but we at least provide updates and uh and try to scope sort of what we can get done within those big long running projects within a cycle. Amazing. Why don't you go ahead and share your screen with us? I always like to give people a peak into our actual Base Camp accounts, how we use Base Camp internally, and I know there's a couple of projects that you want to kind of walk us through. So when you are ready we can walk through that. There we go. Everybody should have that now. Perfect. All right. So yeah, this is my uh my base camp home screen and and I use uh drawers as I call them or stacks as other people call them to uh to organize stuff. Um so you know, as with most of the company, we we've got these HQ projects that that everybody's in and and sort of deal with company business. um within ops. Uh this is kind of where all of our work happens. Um I've got uh an internal ops uh project that I'm not going to show you and I won't show the rest of the company because it's our our own little private space that nobody else gets in. And we have a leadership area that uh that is just for myself and and the three team leads. But uh the ops project the whole company's on and the on call project uh anybody who wants to be on it can be there. So that that is where I'll spend most of my my focus here. So um this the ops on call project is where we handle the reactive side uh of of our ops work as it says up top there. Um we do a lot of it within columns. So the uh the card table/columns section is is where the company will drop cards for anything that they need from us. Um, but it's also where we as an ops team will record stuff uh that we need to work on. But all of this is supposed to be sort of on the small bite uh you know reactive small bits of work side of things. Um if we have big projects or big cycle projects, we we like to sort of keep them in a separate project or at least within a within a separate card elsewhere that I'll show in a minute. But um as you can see, there's kind of a lot. uh we we have a backlog that we try to keep fairly reasonable, but you know, sometimes it gets out of hand. Um and then yeah, we're either figuring it out or we're actively working on it. And then you can see we've we finished uh 3.67,000 uh cards within this project. So we we've used this for quite a while. We used to use to-dos to do this. Um but now we use cards and it it works a lot better. Nice. That actually answers someone's question. Um how do you manage it tickets within Base Camp or do you use a dedicated ticketing system? Base Camp all the way from to-dos to card tables which are previously were called columns but we're old school relics. 100%. Yeah, we we use base camp for as much as we can which isn't to say that um you know in the ops team we interact with a lot of vendors so we end up using a lot of vendors ticketing systems to do stuff. Uh primarily our our data center folks we interact with them a lot um via their ticketing system. And one thing I'll say is that uh you know we don't have anyone located at our data centers. We get that question a lot. Um, all of our servers are are located either in Illinois or in Virginia or a couple in Amsterdam. Um, but we have people who work for our data center companies who are uh who are there on site in case we need, you know, hardware maintenance taken care of or things like that. We're not we're not there. I've only ever been to one of them. Uh, I've never been to the other ones. Um, 14 years. Right. So going back here, uh the primary ops uh project is where we do sort of more long form work. So um I talked about cycle planning. Um this is where we capture cards for uh for the work that we're trying to do in uh both the current cycle in the next cycle hopefully and uh and then in future cycle projects. This this is kind of the the long form backlog that we have where uh some of this stuff used to live in the the on call card table backlog and it got kind of unwieldy so we we moved it here where you know if if you've got a pitch for something that you'd like ops to work on in the future then we we try to put it here uh and then uh and then track stuff here. But a lot of this will uh we don't actually track the work within here most of the time within the in progress cards. Uh they're mostly just placeholders that will point to other projects where we do the the big long form work. Um if we go back home, go ahead. I have a question about that one real quick. And I think also somebody else said if these projects are so long, can't you just divide it up into be more bite-sized pieces of six weeks? Does it end up working often? Like if for you know that something's going to take three years maybe. Yeah. you when you are looking at like how to break it into smaller approachable things especially like by team or something does does six weeks kind of make the most sense or are you using something else like well that's kind of like a natural stopper regardless of how long it takes. Sure. Um I I think it does work for us. I think it sometimes works better for software teams than for uh ops teams like us. But but I still think 6 weeks is a good chunk of time to be able to get a significant amount of work done, but not so much time that you know you can you can just stretch on forever. So yes, I think I think it does work well and I think we generally do a pretty good job of scoping even for a very large project what we're going to try and get done uh within those six weeks. But the the one thing that I would say about working in cycles in ops is that sometimes you just get way laid, right? Like last year we were uh hit by a pretty big DOS attack and had to put the brakes on like literally everything we were going to do for that whole cycle and just do reactive work for the next month or two. Um so all of our planning has a a heaping amount of uh of slack at times to take account for unexpected things that happen. Perfect. Erin, as you go into other projects, I do want to get to Adam's question, which is how we use the message board and docs and what the difference is. I think that project you were just on has both a message board and docs, if I'm not mistaken. It does. Uh, and so the message board is uh is really for announcements um within the ops project and they're generally they're generally announcements to the rest of the company um from ops. So we we announce sort of big milestones here for the most part. Um we also sometimes use the message board in the office project for uh you know team sort of info messages things like that. um whereas uh for docs and files we don't use as much as uh as some other areas in the company but um but yeah we generally use it for sort of capturing information um putting putting planning documents and and things like that in there. Um, one thing that I will note is uh is the chat over here in the ops project is um kind of the the lifeblood of the ops team in some ways. Like that's where we live all day is is within uh is with within the campfire that's in in base camp. Um and uh good segue to talk about sort of alerting and uh and how we interact with that which I know some people had questions on. So the chat within the ops project is intended to be uh sort of the home for the team. So that's where we talk to each other for the most part. Um but if there are super critical alerts, if there are alerts that are important enough to have gone to someone's phone, then they also go into this ops chat to get everyone's attentions. Um, but we have several other levels of alerting chats that uh that I can go into and one of them is back in the project I was in earlier, the on call chat. Um, so this chat uh is where we put alerts from our our monitoring systems. You can see Prometheus here, uh, which I can talk about in a minute, but uh, this is for alerts that are important but not critical. Um, we only get alerts for production and beta systems within here. Uh, we have extensive staging environments and and things like that that we use for a lot of testing, but they don't send alerts here. They send alerts to another chat um just to keep the the clutter and the uh the noise down. So, if you're the on call, you're you're supposed to be keeping an eye on this chat and and trying to be on top of anything that happens um before it can escalate into a problem that will affect customers. Okay. Um another question for you specifically about Base Camp. Someone asked in the audience, can you show us one of the one of the inrogress details how you reference other projects? That didn't seem clear. I don't know if there's a place where you can easily get to where you're linking to another base camp project or another file or document. We use this all the time internally. Um yeah, in cycle planning. Here's a good one. So these are just sort of the placeholders for a bunch of things that we have in progress. One of them is we're upgrading uh we run Ubuntu Linux on all of our servers and we have a big project that uh that is to upgrade to a new version of Ubuntu that has a new kernel which unlocks some performance for us. We're also doing some hardware upgrades. So um this has some you know high level information but for the most part it's just a link to this ops host upgrades project. Um, and uh, you can see in the in the host upgrades project, we use the card table to track a bunch of these servers that we're uh that we're working on. So, for instance, uh, these are ones that we need to upgrade, but they're really hard. So, they're kind of in the in the backlog for now. Um, these are ones that are not so hard and and are good candidates for if somebody wants to go grab one and upgrade upgrade it, then get one of these. Um, and then yeah, we use uh we use Move the Needle um to uh to track how we're doing. We uh we like move the needle quite a bit. Um yeah, great answered the question. Yeah, you sure did. Erin, also um you mentioned that Campfire or our chat tool is like the main way you guys are communicating with each other. I want to get to this question which is perfect to explain. If Base Camp goes down, how do you track that? Meaning if you're using base camp to track tickets and base camp stops working, how do you deal with that? Yep. Uh that's a big part of our job. Hopefully not base camp working. Hopefully base camp works all the time. But um but a big part of our job is planning for uh unusual events that might happen and and what we'll do in that case. Um we've had several goes at this uh over the years. We have an old old project uh product called Campfire that uh that we used as our backup coms for a period of time, but that's still on our own infrastructure. So, it's not perfect. Um for a while, we were using uh this incident management software called Incident.io, which we liked quite a bit. Um but they ended up just not being the best match for our workloads. Um just because we do everything in Base Camp and it wasn't a super tight integration. But when we were with them, we were using Slack because that's what they're kind of made to work on. Um, but now we have a Once Campfire instance. And um, for those who are not familiar, Once is sort of our take on shrink wrapped software. So you can buy a campfire chat from once.com that is uh sort of a an all-in-one uh Slack replacement that you just pay for once. Um, so essentially we have a campfire instance from that uh that is running completely separate from our own infrastructure. It's running just off on uh uh I think it's running on digital ocean but just running um you know on a VPS somewhere that uh that we set up so that if if everything goes wrong with our infrastructure we still have somewhere to get together. So um so the emergency uh plan which we have in our documentation is yeah to to chat with each other there and to get on Zoom for for voice communications. Perfect. Okay. I think those are the I'm keeping some questions aside for the end. I think those are our base camp specific questions for now. Erin, are there any other projects or anything else you want us to see before we just jump into questions? Um, no, that's that's pretty good. Uh, I mean, so we do have the uh the teams uh folder here that that each of the teams do have their own uh individual projects that they deal with sort of team specific communications which is frequently around things like uh you know on call coverage and uh and vacations and stuff like that. But you know they also chat with each other within their own uh their own time zones within their team pages. Um, and then yeah, I have some some management things that I'm on that are kind of separate that are just uh little uh little private projects to discuss things. Um, and yeah, these these down here are the four sort of most important things for me right now. And uh obviously the S3 to pure migration is uh is a number one on my priority list. We're working on moving five pabytes of data off of S3. Uh we've actually already copied it out. Um, it happened last week and it went super fast and I'm I'm pretty proud of it. We are copying out of S3 at 80 gigabits a second for uh a little more than a week to get everything copied over to our pure storage systems. Um, and that's done. So, that's very cool. Uh, we're hoping to do this other card which uh will be a big one is our AWS account closure project. um getting off S3 is the the big headline thing that we still need to do um to close down the account, but uh you know AWS is is a bit like a junk drawer and when you have it, you just kind of throw stuff in there and there's a lot of stuff that we still have that uh that we need to migrate off and so this is kind of where we're tracking that. Um and then yeah, one question that that people also have asked that I saw one of the questions on was uh how do we manage our infrastructure and servers? Um, and we use a a project called Sync, which is an open- source fork of Chef to, uh, to do that. So, um, if we have a new server that shows up, we, uh, we do a pretty quick bootstrapping process to, to get it online and on the network and all that. And then, um, we have some some purpose-built sync recipes that, uh, that will send it off to be whatever kind of server we need it to be. Um, and this is our project for, uh, for building all the the sync infrastructure. Um, and then the other one I have up here is Tailscale, which is a a VPN product that we're uh we're evaluating for for internal use and maybe for some infrastructure use for a uh a future uh product of ours that we're releasing this summer. Nice. I always like to look at people's home screens because I think what we do internally here that maybe not everyone does, but I think is really effective is really breaking things down into small sections. Like you have teams have their own project. every kind of initiative that you're working on has its own specific place to go. You have all parents and all pets, so it's like social places. We really segment a lot of our work into these different projects. Yep, for sure. Okay, with that being said, I'm going to just dive into some of these questions. If you have a question, if you will drop them in the Q&A section versus the chat, we'll really make sure we get through some of these Q&A questions. So, I'm going to just start from the top. Question from Eric. Can you talk about the decision not to use Kubernetes? Did you look at things like Nomad or Swarm? Yeah. Um, and this is Cloud Exit. Yep. Defin definitely cloud exit adjacent there. So, uh, when we we first used the cloud with Google, we used GCP for a period of time. Um, and then we moved to AWS, uh, kind of the best place to put our apps was Kubernetes. Um and uh we liked Kubernetes. It was certainly a a convenient way to deal with uh with some of our application problems. But um I would say there are kind of two reasons why we decided not to go with it. And uh reason number one is that I I think the question there was a more long form version of this question uh and that asked about upgrades and cluster management and that that was certainly a concern that I had. Um, Kubernetes is one thing when you have a platform as a service provider like AWS or Google uh, managing it for you, but um, Kubernetes is a very large and unwieldy and complex piece of software to try to run. And I was frankly kind of afraid of doing cluster upgrades in production. Um, and all of the like testing and rigor that would need to go into every version upgrade that you would have to do on the cluster would mean that you, you know, when we ran it in AWS for instance, we had like staging standby clusters that we could upgrade to test things and all of that and just doing that for yourself is is a lot of work and a lot of overhead. So I was leerary of that. Um and the other thing was just that like Kubernetes is built for very very large applications of which there are probably I don't know less than a hundred on the internet total that really meet the scale that you need to be running Kubernetes and for most software Kubernetes is kind of overkill. um we used a lot of it just because it was there but uh but I wouldn't say that we needed 90% of what it has to offer and and the way that we've chosen to deploy things um using Kimal which is uh which does use docker which which is a big part of Kubernetes so like we are using containers within Kimal um but the way that we're deploying and orchestrating those containers is is far more simple and much easier to sort of introspect and deal with unknowns when you're trying to troubleshoot a a a production problem than trying to understand the whole of Kubernetes in your brain to try and figure out what's going wrong. Um, and yeah, we when we were looking at potentially running Kubernetes uh on premise on our own hardware, we did look at uh at a bunch of different vendors. Um the one that we looked at the hardest was one from SUSA Linux uh called Harvester. And um the thing that ultimately kind of swayed us against that was they sort of have two uh ways of running it and neither of them were very appealing to me. And the the one way is that you can run it completely open source and be completely on your own um and have no support whatsoever and pay nothing for it. And the other way was that you can run it in a completely enterprise environment. Uh and I don't have the number in front of me, but it was a very large number. It was like millions of dollars a year uh to run it sort of at the scale that we would need it to do it. And in reality, we needed something in the middle, right? Where like we we do a lot of open source stuff. We're very good at it, we like to think, but it's still nice to have a vendor, you know, who you can talk to about stuff. Uh but it it was certainly not worth the the huge enterprise costs that they quoted us for it. Okay. And I'm going to jump to another question that mentions Kubernetes since you're on this topic. One of the advantage from Jordan who's here live. Thanks for being here. One of the advantages of using a platform like Kubernetes or Heroku etc. is that much of the security is taken care of. Do you have any advice for companies with smaller ops teams or even dev teams who also have to wear an ops hat about hardening a VPS that is being used for Kimal deployments? That feels like a missing piece of the Kimal puzzle. Good question. Um although I would I would dispute a little bit that AWS or GCP or Heroku does it all for you. Um there are still a lot of ways where you can get yourself into trouble security-wise running Kubernetes on on GCP or on AWS. Um it's it's by no means a fullyfledged out of the box secure system. Um, and you know, this is one of the things that uh that gives me pause is that uh especially if you're not doing cluster upgrades frequently, then you can be exposing yourself to security problems without really noticing it. But, um, specifically for Kamal, uh, I mean, it's it's very it's very application specific and it's very specific to which provider you're running. But I would say in general if you're doing sort of basic best practice stuff uh where you're only allowing ports 80 and 443 in through whatever firewall your uh your provider is giving you and and you're not exposing more of your application than you need to, then I think you will probably be okay with Kimal. Um it's it's no more or less secure than most things on the web. um as long as you're doing best practice stuff like keeping your software up to date and and you know firewalling things that you don't need exposed and things like that. Uh and the final thing that I will say is in response to the the DOS attack that we had that I mentioned a year ago um we put all of our applications behind Cloudflare uh and I definitely recommend um basically anyone with a web application to look at Cloudflare or or a competitor or something like that. uh they do a great job. It is their business to to protect you from attacks that are probably bigger than you can respond to. Um I mean I would say that we're not super big in web terms, but we're fairly big. Um and there are certainly bigger people than us out there who could attack us if they wanted to. Um but that's Cloudflare's job and that's what they do and they do it really really well. So, so lean on them and and especially for really small teams or small applications, they're um they're fairly uh low cost or free depending on how much traffic you're putting through them. Okay, Erin, I'm going to go to a couple of lighter questions before you have to go back to technical tanks. Sure. How much have you saved in total after leaving the cloud? Do you have Do you know that number? It is a good question. It's a high number, but what is that number? So, the headline number is is $10 million over five years. Uh that is that is the number that David posted. It was initially we estimated that we were going to save about $7 million over five years. Um and then we uh we upgraded it to 10 million. And I think that's still accurate and still maybe a low ball. I think we might save more than that. Um and that takes into account sort of everything including the S3 move that we haven't quite finished yet. Um the the fun uh sort of high level quote that we have on S3 is that we built this redundant pure storage all flash very very fast and uh and very reliable system in both of our data centers for around and the the five-year cost of the pure storage system that we're replacing S3 with uh is the same as about one year of S3. So those are for our workload that's that's the kind of savings that you can that you can get. Nice. Okay. And then another light question before the technical stuff. How do you identify, evaluate, and decide which ops tasks are most important? Is it qualitative or quantitative? That's a light question. Uh basically, how do you prioritize? Yeah, it's it's hard. Um I would say that sometimes your priorities are set for you by uh by stuff breaking. You know that's the reactive side of the job. Uh and we try to contain that to the on call person as much as we can but uh the reality is that some things are bigger than what one person can deal with like the DOS attack or or any number of you know systems issues we've had over the years. So sometimes it's it's dictated to you both in terms of stuff breaking you need to respond to um but the other piece is sometimes you know the company sets high level directions for you and the cloud exit was uh was a big example of that. So we were given a mission and a mandate to get off the cloud and that sort of directed our work for uh for a solid couple of years and we're right at the tail end of that. Um, and so one of the things that I've tried to impress upon the team is that, you know, some of them have only really been here uh while we've been doing this cloud move stuff. But in the 14 years that I've been here, a lot of the time, most of the time, the the work is dictated by the team members where uh we have very diligent, dedicated people on the ops team and um they all sort of have their purviews, their spheres of influence and and they all know ways that they can make those things better. And so we really try to leave it up to the team as much as we can and let them drive the work that interests them the most and the work that they think will make the biggest difference in in our environment. Love it. Okay, I'm going to jump to Andre's question which is all about metrics. There's lots of metrics that can be collected. How do you decide what to collect and review and how do you ensure that you're alerted to changes and metrics that matter for capacity planning, availability, and security? Good question. Um, and this is something that uh that could have been and has been for for a while up on my pinned list of projects is that we're at the tail end of a very big project to uh to move off of an older monitoring system that we had called Nagios. Um, which I largely developed. Well, I didn't develop Nagios. I developed our usage of Najios. Um, and uh, and implemented it at the company. It was kind of what I did for the first couple years when I was here. I was I was a big I was kind of our monitoring person. That was like my role on the team. Um but it as we grew became, you know, less good of a match for us. And Prometheus is uh is a new and exciting tool that definitely didn't exist when I joined the company. Uh and so that's what we've been deploying. Um we're almost done. We've got just a handful of things left that are still monitored in Najios and and we're getting pretty close. And in terms of answering the question, I think we try to cast as big a net as possible on what metrics we collect. Um because you never really know when you're going to need some weird metric that is out there, right? Like uh having as much data as possible is never a bad thing when you're trying to diagnose weird problems. So um within Prometheus we run something called node exporter which is on all of our servers that gathers all the sort of server specific metrics uh and stores those. We we have specific exporters for uh for the Rails apps for all of our network infrastructure for all that kind of stuff. And we we take in all of that that we can um and and store not forever. We sort of aggregate the metrics as time goes on. They become less precise. Um, but we try to keep as much history as we can. Uh, in terms of alerting, I have this philosophy that, uh, alerting is kind of like tending a garden in that you're never really done and you always have room to improve and things will kind of grow and need pruning and and it's something that you always have to take time to work on. It's never a complete product, so to speak. Um, we've definitely people who do alerting systems, you know, there's there's a a a standard set of things that you need to alert on like response time for your application and errors and uh disk space and CPU and memory for your servers and things like that. And of course, we alert on all of that and then some of it tends to be things that you've learned from incidents where something breaks and you don't know why and then you go and figure it out and you add specific monitoring and alerting for problems. So we have a lot of that uh as well and it's ever evolving and we're always getting better but uh yeah it's I mean it's it's kind of an an unsatisfying answer because it's so sort of company and app specific but uh but that's the reality and the fact is that like you need to tailor it for the stuff that's important for your team and your infrastructure and your customers. Okay, that question was from Andre and I want to go to Andre's second question which is has the ops team ever considered publishing your approach approach to monitoring is approach in general. He says you're a big contributor to the Rails framework. I'd love to see more on the op side. I'm thinking something like a framework for ops info management monitoring that the new Rails team could adopt. Basically, Erin, write a book. Yeah. Is what Andre wants. Um So, I will say uh we have a dev blog and I see Ashley just uh just posted one of our uh one of our dev blogs on on Prometheus and metrics. Um I have pushed the team to to publish more stuff. I know we have a lot of dev blog posts that are uh that are incubating now where we're going to talk more about it. We have a couple great blog posts from Farah that she created throughout the cloud move um that we uh that we put out there. And uh I have a few that I've been trying to write. I uh I have made promises about writing one about follow about follow the sun on call. Um I've written a few about networking throughout the years. Uh so yeah, we could definitely do a better job of sharing. I will I will agree with that. And I'm I'm trying to get people to publish more more devlog stuff. I don't think a book's going to happen, but we'll see. Okay, I'm going to jump to a couple of Kimal specific questions if that works for you. With Kamal, how do you bootstrap the servers with observability services like Loki? Am I saying that right? Yep. Loki, alloy, or graphana. They can't be accessories to a specific app because they must remain in the server even if the app is deleted. Good question. kind I'll I'll dispute a little bit of that but um so we uh the question starts with bootstrapping. So that's something that we do with sync. Um so when when we bring up uh well let me step back one one block which is that our servers are sliced up into virtual machines with uh with KVM which is a sort of baked into Linux virtualization uh platform sort of like VMware. Um so if you have a new bare metal server then most likely you will bootstrap it as a KVM server which just you know lets it uh run individual workload VMs. So then um if you have a new VM that you need for Kimal then you just uh you just basically within some of our sync recipes for KVM you give it an IP address. You tell it how much memory and how many CPUs you need, how much disk and uh and you run sync to to have that VM created. We create all of our VMs using something called cloud init um which makes the process of booting up a new VM very very fast and easy and Farah has written some blog posts about that. Um once it's up which we can we can get a new Kimal VM deployed in less than five minutes pretty easily. It's it's very fast. Um, and the beauty of Kamal is that uh the actual VMs are very very simple. All they need is just a base Linux operating system and Docker. Um, everything else is sort of self-contained within Kamal. Um, so then you know you add your new VM to your application's deploy configs and you deploy the app to it. uh and uh and we have some manual steps that we have to do to add it to load balancers and and uh we have to refresh monitoring and logging and things like that. Um but for the most part once you once you've created a new VM and deployed something to it with Kamal it's just there. Um the questions around uh accessories and monitoring and logs and things like that. Um we have other infrastructure that's set up to be receivers for logs and for metrics and and uh and gather all of those and those are also defined within sync uh or chef there. So we have individual uh chef recipes for you know a uh a logging host, a monitoring host, things like that. and uh and those all um sort of dynamically pick up new app servers and databases and and uh job servers as we deploy them and then you get monitoring and logs and all that. So um we use graphana for all of our graphing stuff uh graphing metrics and building dashboards and all of that. Um, right now we're using Elastic Search for logs. Uh, but we are we are demoing Loki as a potential replacement for Elastic Search uh as a sort of infrastructure logging platform. That was a lot. Hopefully I covered all the questions there. I mean, it's a a deep question. I'm going to go to something lighter. I'm going to go to something lighter. Ashley, you can answer this one, too, which is a little bit about this ShapeUp. Um, Danny says, "Thanks for answering my question. Implementing ShapeUp works great for development teams and other projects, but we have difficulties in how to make this work for ad hoc teams like customer support and ops. How do you balance knowing when to shape or just get things done? Ashley, you want to take that one first? Yeah. Um, what is the goal? What is the goal of it? And does it is it actually kind of fine on its own? There is a I feel like humans are just really good at making things more complicated. Um, so if you can recognize on any team when you're accidentally leaning towards like, you know what, I'm just going to build out and and write a bunch of tasks out and it feels like work, but it's not the actual work, just get started. Um, in the majority of cases, done is better than perfect. And I'd rather that you start something uh than than keep pushing it off and like making tasks and subtasks and things like that. So, um, if you need something a little bit more structured, maybe give yourself like one one uh one workday, one meeting, whatever really kind of makes the most sense if you're working with other people on your team to shape so that you have at least something in mind together, some agreement on what you're going to do. And then maybe take the next two or three days or two or three, I don't know, sessions of some kind, however you break your work down, um, to actually get that stuff done. And then if come back to it again. So probably like a one to two or one to three kind of a model in terms of parts shape versus actual effort. Erin, anything to add to that? Yeah, I think that's really good. Um, and I would say we're we're similar in ops. Um, in that there's the reactive side of the work that we have to make space for and and we try to keep it confined to when you're on call. Um, and there are the unknown unknowns that that rear their heads and uh and take everyone's attention. But in terms of of shaping actual project work, uh, you know, when things go the way that you want them to, we we don't have sort of uh pitch and shaping meetings in the same way that a lot of the development teams do. Um when I go and write my kickoff for the the coming cycle, um there is a fair amount of the work that I kind of know is going to continue from the previous cycle. And so I can kind of try to estimate what from this big chunk of a project that's been going around for six months we would be able to do in the next six weeks. And so I just kind of do that. Um, and then the other piece of it, if if we have what we call small batch projects or uh or new projects or kind of more smaller contained work, um, I'll just talk to the team leads and they'll talk to their teams and and people will propose things and and try to decide what they want to work on. Um, but yeah, I would definitely agree that uh that it should be as light a touch as possible given given the unknowns and given the reactive work. Um, try to give sort of a framework for what you want to get done within the the six week period, but don't be too hung up on it when it gets messed up. Okay, y'all. It's the top of the hour. Erin, do you have 10 more minutes that we can answer just a few more questions? Yeah, sure. Okay, let's knock out just a few more um and go to this answer or question from Lungi Lungi. How do you handle disaster events, data center problems, network problems, servers? It's kind of a broad question. It's a yeah, it's a big one. Uh or you know streaming problems. We uh we react in the moment. Um we we have plans. Uh we have sort of disaster response documents that are within our our ops docs repo. Um most of it is really about staying calm, staying focused on what's actually broken. Well, trying to figure out what's actually broken sometimes. Sometimes it's kind of difficult. Um and uh and some of it just takes experience. You know, everyone has their own sets of systems and everyone's sets of systems break in interesting and bespoke ways at times. And so you you just kind of gain institutional knowledge of what's going to break and how it's going to break and then hopefully you fix stuff so that it doesn't break that way again. Um, our infrastructure for the most part is designed such that no one thing can can take down an app. Um, so we have the two data centers that I talked about uh earlier, one of them in in the Chicago area, one of them in Northern Virginia. Um, for all of our flagship apps, uh, we design and test such that those apps can run in either data center solo. So if uh if Ashburn or Chicago goes completely offline, we can run you know basec camp 4 and hey and uh and most of the other apps from one data center at a time. Um which is an easy thing to say and a very hard thing to actually do in practice. Like a lot of companies say that they have disaster recovery plans, but you know having a plan and having two sites that you actively go out and turn one of them off every now and again, there's there's a big gulf in between those things. And it took us many years of my time working here to to have that capability and we're very proud of it. Um but yeah, stuff breaks, you know, and and we just try and deal with it and roll with the punches as we go. Um, I I would say hopefully for the most part if say a server goes down or a piece of networking equipment or a provider has an issue or something like that, it's it's more about um mitigating the impacts immediately to to the customers or to uh the stability of the site or anything like that. Uh, and once you have it stabilized, going in and and fixing whatever actually broke. Um, but that can take all kinds of different forms based on based on what breaks. So, yeah, you got to have you got to have plans and and the alternate comms that we talked about earlier and uh lots of back doors to get into stuff just in case the the front doors closed. And just a side note on that one because I'm on the talking to the customer translating what Aaron is saying, trying to help somebody who's like what? Um, so for that part being as clear as you can. So boiling it down so when you're on the ops team to you know we are going to deal with a wide variety of experience and so for us to have to be able to communicate that we kind of need to know what you're talking about but we don't actually need to know like the full details. So as as boiled down as you can when you are then doing the last step hopefully of like communicating things out to um the rest of your team uh like the rest of your company and then you know the other people like customers. Yeah. And we're not always the best at at giving uh giving you all in in customer support sort of non-jargony things to communicate. So we I love asking question. Yeah. No worries. Um one other thing I'll say is uh is we're very big on checklists. Uh and we try to have a lot of those within our emergency preparedness docs, things like that. Um, I'm a private pilot and like you always forget things and having a checklist in front of you when things go wrong is maybe the most important thing that you can do because you can you can go in advance and write out the most important things and we as humans are very fallable creatures and we forget stuff and we skip stuff. So it's always good to have a list that you can go through of of highlevel things in case of a disaster that I do this then I do this then I do this and and more often than not if you follow the list things will go much better than if you don't. Amazing. I'm going to get to this quick question which you answered in the very beginning but let's get to it. How many people are in the ops department? Are they fully dedicated to ops or are they also doing other projects? There are 11 of us uh myself included and we are fully ops all the time. Um but uh the only caveat to that is like we all kind of do different things. We're we're big believers in the the T-shaped engineer where um you know we have a very broad uh but not so deep level of knowledge on all of our systems and then each of us has our specialties uh on the team. So we actually have a uh a we call it the responsibilities matrix where for each sort of highlevel thing like networking, monitoring, security, databases, all of the the big topics of the team. Um we have subject matter experts in each of our regions to respond to each one of those kind of issues. Um so yeah, everybody does ops, but everybody kind of does ops in their own unique way. Okay, I really like this question. This is why I'm pulling it. What does a typical workday look for an ops person? Checking dashboards and looking at numbers. Investigating health issues. Doing a backup restore test. Would love to hear some real world scenarios that happened recently. Like what's a day in the life? Interesting. That's a good question. I thought so. I would say it the one thing that will depend is whether you're the on call person that week. Um and uh to a lesser degree whether you're doing regional on call coverage for someone else during their on call week. So, if you are the weekly on call, um then a lot of your time might be spent looking at alerts, looking at graphs, making sure everything's okay. Uh looking at cards that have come into the uh to the the triage area for ops to um you know, respond to issues from the rest of the company, things like that. Uh hopefully if you're not on call then you get to spend sort of long form heads down no distractions time on whatever project that you are uh you're working on that cycle. Um and those projects can be very very varied. So it kind of depends on what you're working on. Um it might be uh you know like the the project I looked at earlier where you're doing host upgrades. we might be coordinating all of the work that's needed so that we can take down a server and upgrade it. Um, and that can take hours uh or longer for some of them if if they're complicated workloads. Um, if you know the the big project for a bunch of us has been this big S3 move. So, a lot of that was putting into place the the infrastructure that's needed to be able to copy uh all of this data out of S3 and and doing stuff like that. Um my personal workload sometimes and and recently has been a bit more than usual is that I do a lot of the sort of vendor uh communications um I talk to our data center folks. I buy hardware um I deal with you know account managers for all of the the outside groups that we work with. Um so there's that and then uh I would say the other piece that that is unique to our team maybe is that because we are spread across the globe um very frequently when you sign on in the morning there is a lot to catch up on. Uh at least the first hour or so of my day is spent just reading what the other regional teams have been up to since I signed off at 5:00 the previous day. Um there's there's a lot that goes on which is amazing and great but uh but also you kind of have to read yourself in every morning on on what has happened overnight and and figure out you know if there's anything that you need to respond to. Erin, can you see the list of questions? I can. There are a handful that we haven't gotten to. I'm going to let you kind of take a look through those if there's any that you're like yes that is one that we should definitely make sure to answer. Yeah. Some of these are in a language that seems foreign to me. I can Yeah, I can try and grab these uh quickly. So, the one up top for me is how how do you handle restarts of non-responsive processes in uh in database and application instances? Um for for databases, uh hopefully that doesn't happen and for the most part it doesn't. Knock on something. Um but we uh we use a piece of software called Orchestrator. uh that was originally developed by GitHub and uh and then I think passed on to Perona and what that does is sort of um well so if you're familiar with AWS it acts a little bit like their uh their RDS Aurora product where if um worst case scenario if your primary database fails then orchestrator will go in and move the the writing instance to be one of the other replicas. All of our databases have have one writer and multiple replicas, but the replicas are standing by to take over anytime they need to and orchestrator is the software that kind of manages that for us. Um, as far as the application side, uh, Kamal handles that. Um, and also Docker and Kamal sort of handle, you know, if if a if a process starts using too much memory and and it gets killed, then uh, then Kamal and Docker will restart it. Um, we've we've used lots of these technologies, blue pill and and other stuff over the years to manage it, but right now it's mostly Cabal. Um, hope that answers that. Yeah, Erin, if you want to just grab one or two more questions. If we didn't get to your question, Ashley's going to send you a follow-up email. And if you just want to reply, ask your question. We will seek out the answer from the ops team and get back with you. But we don't want to keep you guys all day long. So, Erin, I'm going to let you pick one, maybe two. No, you know what? I think I can do all these real quick. We're going to do it. I love it. Saving Ashley some work. I'm not gonna read all of them, but uh but I'll get get the gist of it. So, Pablo asked, "How do we manage cron jobs with Kamal?" The answer is that we don't. Um we migrated them all to not be cron jobs and to be uh sort of individual job processes that are run by the apps themselves. So, what used to be a Linux cron job is now just something that's scheduled in the app that runs as just a normal Ruby on Rails job. Um thoughts on AI? Um, I think it's interesting and I think it's maybe more useful or at least the the uses that I've seen for it are maybe more valid on the sort of software development side than they are on the op side. I haven't really seen a killer app for AI and ops yet, but I'm open to uh considering it if it comes around. Um, Matt asked, "What hypervisor did we end up using?" um talked about this a little bit earlier. It's it's KVM. Um and that's, you know, standard open source Linux way of of slicing up boxes into VMs. Um and how many physical hosts per colo? Um I don't have the number right in in my head. I think it's I think each of our primary data centers has somewhere around 80 servers. Uh they're all sort of one or two U standard Dell servers and we we add more as we need to. Um, all right. Next one is, how has the employee time and cost involved to move off the cloud been factored in? It hasn't because moving to the cloud never really saved us much employee time. Um, that's the very short answer to that question. David has talked about it more at length. Um, the cloud was by no means a magic bullet for us in terms of ops time. um keeping managing the cloud itself and managing sort of all of the software and the uh the upgrades that are forced on you through cloud providers and things like that ended up taking just as much or more time than than keeping the lights on within our own walls I would say. So we we haven't we didn't shrink the team at all when we went to the to the cloud and we haven't grown it since we came home. And last question good a network question. Um curious as well. Do you end up using BGP with ISPs for redundancy across multiple carriers? We do. Uh we you might have seen one of my cards earlier was around BGP. Um so yes, we have multiple uplinks both with our data center provider and with another uh completely separate internet service provider. Uh this is in my wheelhouse. I worked at an ISP for my my first job. Um, so yeah, we have uh multiple redundant 10 gig links to uh to uh to our carriers and we run BGP with all of them. Um, the card that I mentioned is is we want to improve the way that we do BGP to to be able to select routes to different carriers a little more discreetly. Um, Ashley will attest that sometimes customers have trouble getting to us uh from various ISPs and and we would like to be able to uh to affect that and and fix things for our customers if we can. That was like rapid fire. Yeah. Thanks for staying late. Those of you who are still here, thanks for being patient with us in the beginning. We still have a lot of you, so I'm glad you made it over. Um, like we said, we're going to send a follow-up email. And if you have any follow-up questions, just let us know. Just reply to that email. Thanks again for being here. Erin, thank you for your time and answering all these questions. 23 questions answered and uh we will talk to you guys soon. You guys have a great rest of your day. Thanks everybody.
Video description
In this Basecamp Office Hours session (recorded live on June 11, 2025), Kimberly (Product), Ashley (Support) and Eron (Operations) from the 37signals team walk through how the Operations team uses Basecamp to organize their work. They share a peek into a few Ops projects and show how they're organized. Eron and answers questions from the audience about his team and 37signals' cloud exit. For follow up questions, reach out to guides@basecamp.com. Timestamps: 01:25 – Meet the 37signals team 04:07 – What the 37signals Ops team looks like 07:45 – How the Ops team works with 6-week cycles 09:40 – How the Ops team uses Basecamp 15:32 – How Ops uses the Basecamp Message Board, Docs & Files and Chat 18:18 – Referencing and linking to other Basecamp projects 19:58 – Using ONCE Campfire as a backup chat tool 22:00 – Breaking work into projects (teams, management, social, short-term projects, etc.) 25:15 – The team's decision to not use Kubernetes 29:10 – Security considerations when not using Kubernetes 32:01 – How much has been saved from moving out of the could 33:15 – Deciding which tasks to prioritize in Ops 35:02 – How Ops decides which metrics to collect and review 38:21 – Where to learn more about the Ops team processes 39:45 – Questions about Kamal 43:12 – ShapeUp non-development teams, like Ops and Customer Support 46:43 – Handling disaster events 51:10 – Number of staff in Ops and their roles 52:06 – Typical work day of Ops team 55:08 – Handling restarts of non-responsive processes 57:00 – How we manage Cron jobs 57:23 – Thoughts on AI 58:16 – Employee costs with moving out of the cloud 1:00:00 – Wrap-up