Florent Beaurain: Optimizing Rails Tests at Doctolib Scale

Ruby on Rails · 1.3K views · 28 likes

Analysis Summary

20% Minimal Influence

mildmoderatesevere

“This is a highly transparent technical interview; be aware that the 'revelation' of Rails being easy is a common community narrative used to encourage adoption among developers frustrated with lower-level languages.”

Transparency Transparent

Human Detected

98%

Signals

The content is a long-form technical interview featuring authentic human interaction, personal career histories, and specific technical nuances that lack the formulaic structure of AI-generated scripts. The speech patterns, including non-native English phrasing and conversational fillers, are highly characteristic of genuine human dialogue.

Natural Speech Patterns The transcript contains natural filler words ('um', 'uh'), self-corrections, and non-standard grammatical structures typical of a non-native speaker ('one of the guy', 'what bring me at').

Personal Anecdotes The guest shares a specific, relatable story about almost quitting school due to C++ and being inspired by a classmate to try Ruby.

Interactive Dialogue The host and guest engage in a back-and-forth conversation with spontaneous follow-up questions about Rails versions and career motivations.

Worth Noting

Positive elements

This video provides rare, specific data points on managing a 3-million-line Rails monolith, including CPU hour costs and database scaling limits with AWS Aurora.

Influence Dimensions

How are these scored?

About this analysis

Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.

This analysis is a tool for your own thinking — what you do with it is up to you.

Analyzed March 13, 2026 at 16:07 UTC Model google/gemini-3-flash-preview-20251217 Prompt Pack bouncer_influence_analyzer 2026-03-08a App Version 0.1.0

More on This Topic

Related content covering similar topics.

You NEED to start using test containers!

typecraft

Low Mostly Transparent

ruby on rails software testing postgresql

Transcript

Welcome to On Rails, the podcast where we dig into the technical decisions behind building and maintaining production Ruby on Rails apps. I'm your host, Robbie Russell. In this episode, I'm joined by Floron Boron, a longtime Rails engineer at Doctor Lib, home to one of the largest Rails monoliths in Europe. Dr. Lib runs on over 3 million lines of Rails code with hundreds of engineers contributing daily. their test suite more than 80,000 tests per commit which takes 130 plus CPU hours. Floron shares how his team revisited Rails defaults to improve developer experience and cut infrastructure costs like dropping one engine's test time from 7 minutes to just under one. We talk about what slows big test suites down, how to fix it, the hidden cost of using factories, why packwork didn't quite live up to the dream, how they route read traffic across Postgress replicas, and lessons from years of Rails upgrades and a fast moving organization. Floron joins us from the north of France. All right, check for your belongings all aboard. [Music] Floron, welcome to On Rails. >> Thanks for thanks for inviting me. So question I like to start with is what keeps you on Rails? >> When I was at school for development and engineering stuff, we did a lot of C and C++ and I basically hated it. I almost quit school because of that and completely changed my life to do something else. I think I was in like a third grade or something like that and we have this big project over two years to complete our studies to do and that's the only project where we were free to choose the technology and one of the guy at my school was doing rubians okay and he was talking a lot about it how is it's amazing and stuff so with my group we decide to pick it and try that's the first time we watch some Ruby and some r and I was so impressed on how everything is easy everything I needed it was already there. Okay, it was either already in the standard library or we have it in active support or we have a gem for it. And so the experience was amazing and it was the revelation. I was like that's what I need. I'm not smart enough for C stuff C++ always rebuilt everything but this is the kind of thing I like and that's what bring me at and I never changed since been almost 10 years now and I have done nothing else than Ruby and OB. Do you remember approximately what version of Rails that was? I'm trying to remember off top of my head. >> Five, I think. Something like that or four or Yeah. The transition between four and five. >> You know, for me, having been in the Ruben Rails ecosystem for over 20 plus years now. It's always interesting to get to talk to different people that are introduced to Rails at a different point in the life cycle where you mentioned that there's all these gems available to do a lot of things you wanted to accomplish. And so there was this huge ecosystem that had been around for a decade already. And comparing that to your experience with the other languages that you were learning in school, do you feel like Rails is kind of like what keep you interested in computer science then in many ways? >> Yeah, 100%. When I say I was on on my way for quitting, it was the truth. It was clearly not for me. My grade was not good enough at school etc. because doesn't keep me interested on the thing and I was okay it's too hard for me. Were you thinking about software development at that point in terms of like how you would use the technology in terms of like when you were working and learning at computer science? Were you thinking I want to build web application type tools or backend or closer to the hardware? Like was there something kind of drawing you to specifically towards maybe a more web centric area for development? >> Not at the beginning. That's my uh internship that brings me to the web because when you try to find an internship at least in France most of what you will find is basically in web. So that's how I started web development and my penal project was in Rubian so for the web and that's where I discovered that this is an amazing platform you know you can build so many things so easily and you can push it to millions of people. The experience was amazing. So I keep pushing on that and that's also where you have most of the job also. >> That's true. I think being able to deploy an application to the internet and have anybody be able to access it anywhere with their web browser. That's very different than being able to like ship some physical product and hoping that people will buy it and then then your code might end up working in someone's device somewhere or something. It's interesting. I learned a lot about the organization you work for called Dr. lib and how it's I think it's one of the largest companies that's using ruban rails I think in Europe. Is that correct? >> Yes, probably. At least in France. >> We had a brief conversation before this, but I know that Dr. Lib CI suite currently runs 84,000 tests. >> Yes. >> Which consumes over 130 CPU hours on a full run. >> Exactly. >> Just trying to wrap my head around that a little bit. How did it get to that point first of all? So I joined eight years ago and we had like 5,000 test. So it was running on genkins at that time and uh it was working pretty well. Then when I joined we were like 15 engineers and then that was the point where the company was growing a lot. Okay. So we were hiring 100% per month uh 10 to 15 engineers per month joining the company. So it was pretty fast. So quickly we had massive amount of engineers and we starting to write a lot of test and so this number grew quickly quickly quickly. So the time on genkins was not accessible enough. So we migrate to I think it was a CI at that time. So we had the paralization of feroki with 15 workers if I'm not wrong that keep the CI time a bit lower and then we call test and then we end in this situation where we are now on infrastructure and stuff to to run the CI which is currently bigger than production platform >> in that context where you were onboarding a lot of new developers and you mentioned you started there around say 5,000 tests and now there's 84,000 tests in your test suite at the time when you joined, how was the code to test ratio at that point? Were you was it pretty consistent to what it is now? And the the applications just grown that much more or was there a considerable amount of time invested in writing a lot of tests that needed to already be there? When you joined, was there a lack of test coverage? >> If your question was, do we have a lack of coverage? When I joined, it was not the case. It's really a linear growth based on the more capacity to to ship more. It was already very in the culture to write a lot of tests when I joined and mostly end to end and we can talk about that later. So that's explains the number of tests and the the big number we have now. >> You know I think for people listening like well if you have a bunch of automated CI running and it's taking that long is that really a problem is that something you needed to address or can kind of focus on improving >> the duration we have on on the request. I have we have to work on that a lot. you have basically several option on the table. I think it's like scaling a web application. Either you throw money at it or you throw man power to it to try to make it faster or at least more performant. So we can do both. We throw a lot of money to it to have more parallelization. So we just basically add more servers to treat one commit. At some point we are launching like 350 servers for one commit. >> Oh my gosh. That's basically one way of fixing it and the other way is to putting some people on it trying to make it more performance. The third way we have also is finding a way to launch less test. So basically applied principle based on the pull request and select the test you have to run. That's another way to reduce it. So we also do that and uh yeah that's the way we have to basically scale that. But at some point the duration is a problem for the verity of the team and we have to find a way. Yeah. Yeah. I I can see that potentially causing a lot of bottlenecks. I would imagine if it takes that long and say a couple tests break or you have a couple flaky tests. How many engineers for context do you have right now? Is I think it's like over 700 engineers now. >> We are something around 400 and 500 engineers working on the monetary my apologies not 700. And then given that so like if there's a scenario where people are pushing things and I could just imagine there being a lot of branches that we get stuck in a merge queue because of tests not working or something it takes that long when someone's trying to finish a project that they're working on or a task or something that feedback cycle must I mean it must not be able to run the whole test suite on their own local development reliably can they? >> Uh no in local it's impossible. >> That's currently an issue that we have. uh we don't have a merc. The stability of our main bunch is crucial because if for any reason the main bunch is is red basically uh everyone is uh is kind shutdown and currently I think the pipeline takes 40 or 45 minutes. So if your rebate is on main unlucky you are it was red uh or potentially becoming red in in a few minutes you launch it and you come back 1 hour later and it's red you have to rebase rel and you lost one hour. So yeah, so that's where test selection is really important to try to reduce the time. So if if you don't touch much things and you work on your own engine, you can expect to run few Android tests. So it should be pretty fast, not that fast because there is incompressible things in the workflow like building the looker image and stuff, but you can expect a faster time. >> How long ago did you start working on trying to focus on just improving the performance and speed of your test suite? Uh I have been at it since eight years. So I think it has been the work of a full feature team since eight years. >> And then so over the last eight years you're keeping an eye on these things as the team's growing and the codebase is growing. I know that you mentioned that there's some slowdown that came from I think you specifically talked about the database in particular like in one of the approaches with Ruby on Rails with a typical test suite is that you you run your test and it's constantly resetting your test database in between running those tests. What's wrong with that potentially or what's what doesn't work about that? >> Okay, so I was tough on working on improving the developer experience. So that was not specifically CI oriented. It was really okay. We want to improve the the developer experience locally and as Shopify we have split monolith in what we call engine on the side Shopify I think they name that components okay so it's small boxes of where people put their code and when I mean their codes it's their applicative code but also the test to test it and we were like okay if we want to improve the developer experience one way to do that is having people being able to launch their test locally okay so grow out of this CI driven development workflow we had uh where we just do stuff locally test it if we test it I mean in development on the UI but then push it and expect that the CI is green and then if it's not take the the test failing and just starting from there but you have already lost 45 minutes so port was if the engine is isolated enough just launching the engine test should be enough to have the big picture of what will fail or will it the CI but even just launching this subset of test locally was pretty slow when I say pretty slow was like something around 300 400 test it was like more than six to seven minutes on M4 macro so pretty slow so we wanted to improve that and so most of my teammate was yeah but you know it's a bit slow at the time so I did this small video to to back the project where I put a van app and I create thousands of tests and I launch them locally with the same setup as well. So a toker running a PG database etc. And it was incredible, incredibly fast. Okay, it was a a matter of seconds to launch thousand and thousand of test. And I was okay. So that's not really the problem. That's not really the problem. And if you follow David on Twitter, you you see essentially he did a lot of benchmark with the eight test fit where I launched thousands of test and it's fast. So why it's not fast on our side. So I did a couple of flame graph and stuff to see what is the bottleneck on our raise application and I came to the conclusion of several things. One of the main bottleneck is the database several things at the database level. We are resetting the database between each test. That's a common pattern. But this is very slow and it's even slower in application because we have multiple database. So for each test we don't resent only one database but we resent 10 database and every time we add a new database it's becoming slower and slower. So we had to change that basically and uh it it was accounting for almost I would say something like 30% of the time of the test duration for unit test not for end to end test of course for unit test yeah it was a massive amount and the second conclusion is factories factories were pretty slow because they are interacting with the database this database which is in a docker running on Mac OS so not a perfect world so the database is basically slow by itself Not all factories are pretty well built. Okay. So some are crunching some Ruby a lot triggering some events that that don't use stuff that create another object etc etc. So we have a cascade of a lot of amount of object just to create for example one account. So when I measure it we are spending like 50% of the time in factories for test. >> One of the things you mentioned there around it's a common pattern is to reset your test database between test. Do you have a good sense for why that's an important thing for test or because to try to maintain or just to avoid any weird things if you're just trying to run a quick test? >> So I I will not advise anybody to not resetting the database between test. Okay, it's really important to avoid to to have test data that leak to other test and so have a consistency in your test to avoid flakiness for example. But there is other way of doing it than just taking all the table and truncating them. Yeah, >> that's what we were doing basically in a more optimized way than truncate on all the table. But that was basically what we were doing. >> You also mentioned having say 10 different databases or so that you were connecting to is in what context would that be happening? Is this like a multi-tenant type of application or no >> there's just 10 different databases for different types of data that need to be very very much distinct? >> Yeah, so it's for scability issue. So I think we can go to that later. >> Okay. But that's how we we scale up basically. >> To kind of circle back, you talked about using some flame graphs tools and that led you down the path to looking at how the application was or the test suite was resetting the databases between every test that it was running. You identified factories. There's concerns there that that wasn't entirely super efficient in some aspects. Were there other things that you also noticed during that kind of research process? So factories the resetting of the database and then what comes to be basically some plot we have added over the year the monolith is now 12 years old. So we have added a lot of small thing here and here because someone wanted to fix something because he he had a red or because we wanted may to put some safeguard you know to avoid someone to to fall in this trap again. And all those thing are always done in a good intention first but without always you know we don't monitor you put it here but you you don't really know the global impact all you know is just that it fix your problem you know that's basically it and all those things you add it it's 1 ms here 2 ms here 3 ms here but at the end of the day when you have 80,000 test it become a massive problem so that's was basically it the three problem we had the database resets the factories and the blo we have added over the year. So you identify these different areas. Then as an organization, how do you or maybe your team and thinking about the developer experience, how do you then begin to start prioritizing finding ways or solutions to improve that situation for yourselves? >> I presented my research basically and I say look that's what people get when they have a van apps. or current experience and here are the things that differ between two and so this is the three thing we can work on and they were pretty convinced that we had a case so they allow us to to work a bit on it and make work point okay so basically we had a couple of week to migrate one of our engine to a new testing framework that was fast enough at least faster than what we had so that's basically what we we did to me, we don't usually do that. We we don't usually rewatch thing or we we are more in the small improvement things every time. But for this case, it's too much. We have too much bloat. If you touch one thing, everything will fall apart. So, we should probably restart. And I was pretty convinced that there's not that much in fact in CI and we can migrate a big part of it without having to reconstruct everything. Okay, so that's basically what we did. We create our own test classes based on the vania rails test classes and we put nothing more in it and we started with one. So every we had a lot of redness of course on it. So we bring back the code needed to make it green and we tried back and forth like that. Okay, this helper is missing, this helper is missing. Uh we had this configuration before >> and do we need it or not? Okay. And so we even change a bit or test to make it pass and thanks to that we build this new classes on engine and we win. Uh I think we started this engine it was taking like something around 7 minutes to run and we go done at 2 minutes. We added a bit of parallel testing to get the extra juice from the the Mac M4 and we go down below one minute. >> Oh wow. >> So then we share that and people were okay we need that and so we we get the time to to work for the full codebase integration. >> So you were able to identify one of your engines and how did you go about selecting one? Was it one that you felt pretty confident that your team had a lot of exposure to or you mentioned that you had a pretty reliable test suite already? And so I'm trying to imagine you in an editor like looking at your existing test suite. Out of curiosity, what was the test suite that you went from and what did you move to? So for the test the engine selection, it was a trivial choice. We just took the the one where we have code ownership on it. So it's easier to to modify things if we don't rely on the approved process of someone else. We also try to take ones that is not too difficult to migrate with you know like a lot of end to end test with mobile test. So we we try to pick an easy one but not too easy. So it's kind of representative. We try to take an average engine a recent one without too much legacy and that's what we have targeted. So we have one it was like preval to choose. >> But what test suite framework are you using there? >> So we are using mini test for the the world test. We have just a small uh we have something mini test spec rails I think to to have a bit of spec stuff include in active support test case but uh that's it. It's mini test and we haven't changed that. you know, in one of our previous conversations, you kind of described this kind of as a hard reboot on your test architecture or infrastructure, kind of leaning back into Rails defaults. So, did you feel like there were a lot of defaults that had been changed over those several years and that kind of you mentioned people with good intentions and there were kind of global impacts to make things a little bit slower. What were some of the patterns that you kind of noticed that you were needing? Were there configuration things that you were just able to we don't actually even need it like that? Yeah, I think the the two things is not using fixtures and use factories instead. That's was I think 10 years ago that was the default, you know, you you ra new and you you just basically put factory bot in it and that's the way to go. So I think Dr. Lip started like that. So that's one thing we really want to change. So go back to fixtures and the second thing is transactional testing. So instead of resetting our database with custom code truncate, we started to use transactional test that exist in bal. And the idea behind transactional test is when you start your test uh we will take a transaction on your database. Everything you will do will be done inside the transaction and at the end of the test the transaction will be back which is way faster than truncate all the the table. And that's basically the two things we have done to go back to rails default. When I say we have done the transactional test is done the fixture things it's ongoing it's a bit more complicated but we have started to me create some factories to fix. >> So the application started being developed a couple years before you were introduced to rails. Do you feel like you have a sense of why the Ruby on Rails community started to embrace patterns like using factories versus fixtures? I guess I think fixtures like it's one of those things you know that when you are introduced to it you don't like it. It's you know it's YAML file you have to feel it's unflexible. You also feel like you start your test you have a tons of data and you there's some you need there's some you don't need. It doesn't feel the right way. I mean so yeah I think it's not flexible enough and that's why people prefer to go to factories because it's flexible but that's also I think the drawbacks of factories. So it's kind of like we don't like it because it's inflexible but that's the strength of the thing and we like it because it's flexible but that's also the the main drawback of it. It's such an interesting thing where you know remember I'm trying to it's been such a long time now since that kind of began to permeate in the community but I think this was at the time a like developer ergonomics thing. It felt like as a developer working with using and interacting with factories felt a little bit more friendly to us as developers versus handcrafting some YAML files and maintaining >> well-defined and strict it's kind of like working Python and versus Ruby where you got to think about formatting and stuff and like yeah >> so we're like oh this is so much better I can express myself but then there's this you fast forward several years and all of a sudden like that's the thing that's slowing down your developer experience for running your test suite and causing other types of weird there's side effects to that. So I'm like do you think there's a world where there's something in between this that would make developer experience of just writing tests a little bit nicer so we're not thinking about fixtures in the same way but get the fixtures are faster but where is this balance there? >> Uh so yes I would say um in our case we still we are still keeping so we have factory but some fixtures so we are in the in the middle round. If I had to start fresh, one of the thing will do and that's what I do on my pet project is I'm using fixtures and Shopify have a tool named factory fixture or fixture factory I never know and basically it's a tool you can add in your app and you can take a fixtures and say I want this fixture but a bit different. So it's kind of the the middle ground where you have fixtures and thanks to the fixtures you can test 80% of your app and then you use factory fixtures or fixture factory to create some object to test edge cases etc. And it just some helpers basically on top of it but I kind of like this this pattern. >> I'll definitely include links to that once we we tracked it down in the show notes for everybody. Is that one of Shopify's gems you think? Yes. Yeah. It's a chify gem factory fixture I think. And um yeah, it's really like when you use it, you declare a factory and you say I want a user like Bob, but I want it to be admin, for example. And you just put admin true. >> And so you have something like Bob, your fixture Bob, but with admin. >> I would imagine that must add a little bit of latency or takes a little bit more time to process that I would imagine. But think if you can use those where sparingly might be able to keep your test suite running as fast as you possibly can make it I suppose. So going back to your story there as an organization you identified an engine you went through and you were able to get an engine that would take approximately say seven minutes or so to run just that engines test suite down to say less than a minute and then all of a sudden the rest of your team gets it and all right this we can get more buy in to start tackling more engines or do you start going towards your larger applications and and is that something that your team just was responsible for kind of rewriting tests in this type of approach or was that then spread out to the rest of to the organization. >> So very good question. We had some kind of disagreement around that. I was more pushing to you know take the opportunity of a great reset to rebuild everything. Meaning that we build the framework and we put a deadline to teams to migrate to it. So we make everything we can to make it easier for them to migrate but we let them handle it. So we can have some bit of smartness to it, you know. So they migrate, they see stuff, they report to us, we fix, we can improve the the solution and we can also say no but this kind of things we don't want to see it anymore. There's a lot of patterns we had in the codebase that was not acceptable anymore from a performance standpoint overall didn't convince people. So we decide to to go to for another solution where we migrate everything ourselves. So we have to make some trade-off in that where we cannot rewrite every test and everything for every everybody. So we had to keep a maximum of compatibility with what we had in our new framework. Okay. So I think we lost a bit of performance gain but overall we migrate I think now we are at 90% of the codebase to the new framework. So in three months I think it's a it's an achievement and we are pretty happy about it. This episode of On Rails is brought to you by Concerns, the lightweight supplement for bloated models and scattered logic. Are your controllers overworked? Models doing too much? You might be a candidate for concerns. Just one include a day can help extract shared code across your app. Whether or not that code actually belongs there, concerns are modular, reusable, and questionably named. Side effects may include unclear ownership, callback confusion, and saying, "We'll refactor this later," at least once a week. Ask your tech lead if concerns are right for you. concerns because everything has to go somewhere. Can you tell us a little bit more about as you were introducing say like there's new and then you had these legacy test cases to support transition. Did you use any tools to try to automate much of that or was that primarily was there some copy pasting between files and things like that or what did that look like? So it's one of my teammates that have done it, France. A lot of things was just search and replace basically to to replace the class name to another one and CI. I think he split it by engine. So he was migrating one engine and another etc. I think he used also a bit of nothing too complex but a bit of AI to migrate it. So we had basically a conference page explaining the breaking changes. The point was for if users wanted to migrate they can. So we maintain this list of if you have this you have to do this if you have this you have to do this if you have this you need to include that um and I think he he used a bit of AI to feed with this document to migate some tested >> one of the things we we didn't touch on but is is Dr. Lib's platform primarily composed of a monolith with a bunch of engines or are there a bunch of external like other services or kind of somewhere is it a hybrid situation there? >> 10 years ago we had this monolith over the last six seven years we migrate from this monolith to a monolith with some engine inside it and now since two years we also have some external services. So we have this big monolit and 80% of the traffic is still going through the monolith but we have also some new services. >> Are those also built with Ruby and Rails or are you using some other technologies frameworks for those? Um no they are in Java most of them we have a bit of node we have a bit of elixir we have some in Rust most of them are in Java in the the strategy is to have some the external services for now in Java and because of acquisition or some of the stuff we have also some service in in other language. >> I see I see and then I think another thing that I know that you folks are using pack work I believe to modulize parts of your codebase. So >> yes, >> were you around when that decision was made to start using packwork? >> Uh yes, in fact it's my team that decided to do it. So I was in like this architecture team we had at this time and that's the time where we we have this big monol codebase and we were like okay it cannot fit in one head. So we have to find a way to have a team that can work in these small boxes and just have to have these boxes in his head. So have several monitor I don't know and the strategy was to use raise engine to build these small boxes and that's where we started. So exactly like Shopify with their components. I'm not sure they are using R engine to do it but anyway that's kind of the same idea behind it. quickly after they released backward and we were okay it's it's a no-brainer for us because it's basically exactly what we are doing and what we want. So the idea of of have this small package that have their own public API so if someone want to discuss with it they have to go through this public API and we have this declaration of dependencies etc. So that's was a no-brainer for us. So we adopt it and we are still using it but uh and I think that's what you want to discuss after we we have a small disillusion about it. >> Oh interesting you know initially as you were thinking about as an organization like how to allow teams to kind of focus on their area that they're going to own and they would have expose a public API and such using something like packwork. Are your teams like kind of separated already by different areas of your monolith or is it people jumping in and out of things and like this kind of wrap your head around this part of it and that that's where they're going to focus or people jumping around quite a bit between different areas of your platform. So we have what we call domain okay and so a domain is responsible of a big chunk of the product and inside domain we have feature team and each feature team have a bit of this scope okay and they shared like that so everything is really well cut inside the organization around that and of course the codebase is following this you can this this pattern every file is owned by a team each engine is owned by a team it's really well defined which part is own who and who can it's not who can work everyone can contribute I mean but uh that's not what we have tried to do but it's more yeah everyone has his own part of the codebase >> did adopting pack work help the developer experience in terms of things like local dev performance or test reliability or is it more of a like let's just allow people to work in their domain a little bit and focus there >> I don't think so that's my take it's personal but one of the thing was okay we want to modulize the We want to do it for the developer experience, but it's also great way to improve test performances. As I said a bit earlier, we had this test selection stuff h that try based on the diff to launch just the test we need. So if we decouple the application and the engine from other part we should normally be able to launch just a subset of the test. If you work on your engine, you should be able to launch your engine test and you should be pretty confident in real life. So I don't think it has much improved the situation. >> Can you speak to that a little bit more? Like if one of your teams works on like a set of features and maybe they they're responsible for an engine or two and they're running their test suite for their area. What doesn't work about that in air quoting the real world like in the situation where like are they a lot more tightly coupled to other areas? tell us more. >> So it's a big topic but I think getting to a zero dependencies package is a shame. We have not been able to do it and if you want to do it you will have to in a big application after several years of working and I'm not speaking about the green field app you will have to invest massive amount of time to reach it and by doing so you will probably make the code worse term of readability you will have to inverse a lot the dependencies and because of that you will use events that will make everything worse for debugging and crashability. So yeah, it's a complicated topic but I don't think it's reachable to have zero dependency things. Also pork has his own limit. It's a static analysis in Ruby. So even if you reach zero in pork and you try to launch your CI with just this engine for example chance that doesn't work and you will have other stuff to fix over time. you will have to do it again and again and again because someone will introduce a dynamic reference and it doesn't work anymore and so it's a really a lot of works and a big investment to reach that so it's a nice tool we still use it and I still think we people should use it but don't fall in the shim that uh you will just work in your engine launch the test of your engine and it will be amazing >> you and your co-workers there have seen a lot of benefits from it but it didn't necessarily deliver on that promise or the illusion that you're going to be immune to a lot of similar issues that you already have in your codebase where >> yeah exactly that's that's exactly that yeah you won't be immune to that also pork tell you that here there is an issue but it doesn't tell you how to fix it it's not always that easy to fix a dependency that's what I was saying sometimes we we make it worse by trying to fix it so yes it's kindate we are not immune tool is not perfect I It's just a torque. >> When you get to that number of engineers, I can only imagine. I've never worked in that type of space myself. So, I have no concept of like just how much is happening on a day-to-day basis. And to try to try to protect the code base as much as you can and protect the individual developers, that's a lot of competing things. And be able to move fast and you don't want to slow down the velocity of your your dev team. We also don't want everybody just throwing code wherever at the time. there's not someone that's there to make a decision on every single thing. So it's a this interesting immersion of code all and and out of curiosity are you also using any AI stuff now and doing code generation also with AI and taking advantage of that or is that helping at all or we have basically full access and we can leverage it as much as we want. So we have access to you know CL code we have access inside the ID we have access to inside GitHub action so we can automate a lot of stuff and we are encouraged to that I think it's a great tools I don't use it much myself for code generation I don't really like it you know like copy auto complete etc I prefer to type the code myself I use it more like per programmer or when I want to have a second opinion on stuff or to shape my mind about something for draft or this kind of stuff but I think It's really cool tool for onboarding in new codebase. I know that a lot of new joiner use it you know like how we do that in the codebase because of course at the size you try to stick to the default and stick on how rails work but there's some stuff that it's not enough and you want more so you build your own and so people can yeah they need to know how you do that so they use a lot the tools for that myself I use it a lot to build some workflows to make the things like we have feature switches in inside the codebase And uh we have an expiration date on feature switches. So basically a team want to to build a feature they they create a feature switch and there's an expiration date on the the feature switch. So they put it like in three months, six months. Often team forget to do the cleanup. So they have the feature switch. It's enabled in production but the feature switch is still there in the codebase and there is a branch that is useless anymore. And now so I have this workflow that try to clean them up. take the the XPL filter switch opens the pull request ping the team and they just have to review and make few adjustment if needed for this kind of stuff I find it pretty useful you know this current work it was not that easy to automate before because you need a bit of logic on okay this not just a search and replace you know it's okay I have if enable I have to remove that and I have to refactor the file so it's a bit more complex than so bring this bit of intelligence if I can say that allow now to to make this kind of stuff and pretty interesting. >> Actually really curious about your feature switches. So is that your way of deprecating a set of features that you're going to remove at some point or is this just needs to turn on at a certain point or is it kind of like you mentioned kind of like a feature flag but in what context like what pattern does lib use with your team to we need to remove this at some point or >> we use it for basically everything. So it's a complete system now with factor feature switch feature switch etc. But that's how we deal with removing a feature as much as introducing a new feature to client through cohort and stuff. Also how we change a query for example sometimes we just put a feature to to change a query because we are not sure the the query will be performant enough in production. >> So it's really a way to deal to release any sort of code or to remove any sort of code for the stability of the platform basically. When does your team need to make that sort of call on a that granular of a level of there's a team might have a hunch or suspicion that this query might be less performant once we roll it out to production. So we want to kind of test it out. Is that kind of speak to not being able to run like a test against something more like a production data in like a staging or QA environment or your local development is never going to have that level amount of data there. >> Yeah, we don't have this amount of data in staging and test environment. So we cannot really test that here even if we have the data in staging it's not the same you know so the query plan won't be the same it's really hard to replicate that in other environment so we really inceptive people to use feature switch for any kind of stuff we have make it really trivial to add them but if things goes wrong we prefer if it's feature switch basically yeah we put basically everything behind it >> in those situations where you got these feature switches does that not I'm just going to ask kind of like a dumb question here but Has that not introduced a little bit of performance implications itself that you're having to check the current status on these feature flags as you're executing the code? Does any of this code ever get cached? >> Yeah. >> You know in the server that's running and like >> so they are still in the feature switches that are stored in the in the database. So of course there is a K direction of but we are using that since so long that we have optimized it a lot. There is caching at the request level. There is caching at the worker level. Every time you check for a feature switch, we don't retrieve them from the database. It's now almost uh transparent. >> There's a pattern there. If it if one of your teams is going to ship out something that's maybe has this example of a query that you want to see how it's going to perform in production, you ship it out. Do you tend to default it to being off by default or on? And you're like, okay, ship or is it kind of it depends, but then >> the process is like it's shipped out to production, it's deployed, now someone's looking at some metrics that are coming in. and you're like, "All right, I'm going to flip the switch and see what happens for the next 10 minutes and be like, "Oh, that didn't work out. Turn it off." Is that how that's working? >> We roll out. It's false by default. And then we have a UI to activate it. And that's basically what people use. As I said before, there's different types. There is like the on off basically boolean, but we have also a factor of features switch where we can release it for percentage of traffic, you know, 1%, 10% etc. So depending of the criticity of what you're doing, you can use that. You cannot use it in all the cases. There's no stickiness to it. People activate it and then they monitor and uh they can quickly quickly go down. We are also investigating currently to have auto roll back on the f switch to have some automated process. So we use data dog as an APM and we have also sentry for the error reporting. Actually that's a great uh probably something about I so we had the idea to use data dog to with an alert to auto back but I also in mind to try to there is data dog MCP server and sentry also I kind of want to try to let an LLM make the call if we should roll back or not. So basically when I have a feature switch that is put on on on off I have an event and then I can trigger the LM that can monitor Sentry and Data Dog in the next 10 15 minutes and make the call if we should roll back or not. That's something I want to do. >> Sounds interesting. I know I've been uh following a little bit what Sentry is doing there in particular because just follow uh one of them on social media and stuff like that and he's I know he's been talking a lot about that but curious to see how that kind of pans out. think I've seen people talking about like self-healing things and like having the error reporting tools send you PRs to for the potential fix for something and you're like I'm like that's that's fascinating. >> I think there is there's something like that in I haven't tried but I I think there's something like that to investigate your st tra okay in most of the case I mean a lot of centuries we had is is a trivial fix so I'm pretty sure can be good at it. >> Yeah. Yeah. I think that could be quite interesting. Let's talk a little bit about scaling your database. I know that Dr. Lib is I think you're using Aurora Postgress. Is that right? >> Exactly. Yes. >> My understanding that you've hit AWS's limits and you're not the first team I've talked to recently specifically saying this like you're running on the largest instances, right? >> Yep. Yeah, we are running on the largest instances. We have 10 writers today and each writers can have up to 15 readers and uh we already have some writers that reach this readers limit. So we have to remove stuff from this writer and put it elsewhere. >> Why aren't they just making larger systems? I'm I'm kind of being a little facicious there. But um are you just storing too much data? Tell me more about that. like when you get to that scale where you're like hitting the ceiling in that way and you'll eventually have to spin up 11 and the number 12 servers or tell us more about like what's the situation that you're needing to navigate there and how do you you mentioned maybe needing to move data somewhere else tell us more >> yes that's basically it so currently your bottleneck for the scaling is probably almost everyone is the database so when I joined we had one writer and several readers and we pushed that as far as we can. So for the readers, you know, you can have readers, but you need to send queries to it. Okay, so we add some manual code inside the codebase to send queries to the readers and we push that where we could and at some points the writers had too much load into it. So we need to figure things out. When we looked into the the writer, one of the issue had that in fact the writer had so many read to it. So we need to find a way to offload those read to the readers. So we have worked on a solution to automatically send reads to the readers by analyzing the query. We pass the query. We look if it's a select. If it's a select, we can probably send it to the reader. Then we look if we have right inside the table the select is looking to in the same request. If not, we send it to the reader. If if yes, we keep it on the writer because there is replication lag between the the writer and the reader. So to avoid to have stale data we keep it there and so we have done that and we have improved it over the year to offload maximum of of the read into the readers and it was not enough runer at some point was not enough and not because of the amount of data I think on the main writer there's like 40 terabyte something like that that's not really the issue the issue is more the number of operation we do on it every second that's more the the IO the So we needed to have second writer. So we put the second writer. But once you have it, you need to migrate the data from your main writer to the second one. So you need to select which table you need to move. You have to do it with scare because you have to select the table. You won't move one table. You you will move a group of tables that are very related to each other to keep the join and the foreign key etc etc. And to do it safely, you have to first ensure that there is no joint from the table you won't move to the table you will move and you need to build tooling around that etc to be able to do it safely and also because you know that that's not the last time you will do it. So you need to automate that and that's how we ended with one to two and then two to three and now we are at 10. It's interesting. Um, I'll have to I don't have it in front of me, but I think I saw someone at Intercom post something the other day on social media. They were evaluating their their reads and specifically that how much data they were sending to their readers by specifying the columns names that they wanted to select. And they were finding that that was actually a pretty big performance issue where you were sending all this select 30 column names versus select an asterisk just the text being sent across from your app to the database as a reader. >> And I'm like these are the types of issues you have at that scale. And it's like well you don't want to bring everything back, you know, on every query because that doesn't seem performant either, but also like you're sending data to the server. >> Yeah. Yeah. you know, and so it's like that's a big chunk to get less data back. There's an interesting compromise. I hadn't really thought about that. I was like, "Oh, have you encountered anything like that yourself?" >> Yes. No. So, I mean, we don't we don't really measure that and and we don't really bother to just select what we need. Okay. We we are still on the the rails default if I can say where we just account where and then that's it. But we have few models where we know that there's columns with big chunk of data on it. So for those we have specific concerns to avoid to select just this field because we know that it's too big and we have also tons of constraints inside the app for what is related to the database. So when you do a migration there's a lot of things you are forbidden or you must do to be able to merge it. So for every field we ensure there is a proper size etc. You cannot go bigger than that because you know we know too much and we have a lot of targets that teams must ensure. So the table must be much that this amount of data inside it etc that number of column we have a lot of constraint on the database. >> I'm thinking about scenarios where you're introducing a new feature and you might need to modify existing data as part of that roll out that needs to go through all the existing data in your database because you're changing the nature of it. Is it default? You're like, well, we'll set up new columns and we'll just have the other ones exist for a while and at some point phase those out or do you ever modify existing data as part of a roll out of a new feature or something? >> So, we don't modify data during the roll out when I mean the the process of updating people have to do that in several phase with several rollout. So, it really depend of the operation you you want to do. We have our own tooling. We have something called safe page migration. I think it's open source. So we use that. We have also a tool from on migration. So we cannot for example rename a column. We cannot rename a table. We have to do multi-step process for these kind of changes. So it's the same if you want to change a kind of data. They will probably keep the the old one and have a task running in the background to write the new format. then you enable the feature to use the new one. It's always a multi-step process. The database is a is a problem at at multi steps and it's a problem for performances. It's a problem because you are always short on load and it's like a every year we have to spawn new database migrate data. It's a it's always in the rush but it's also a big problem from rollouting when we put in production running migration. There's a lot of kind of migration we cannot do anymore because of the load on the database. We cannot log for that. there's a lot of operations that are forbidden and so we need to do it in multiple step or there stuff we cannot use anymore I don't have one in mind as an example but uh yeah >> those are the types of issues that teams at your type of scale have those are not necessarily things that a small rail shop and new application probably have to even think about but at some point that might need to change as your app grows and gets a little more complicated and takes a lot longer to yeah add a new table or rename a table or you just never get to do that ever again in situations. >> I think this kind of stuff if you have users, you should not do them at all because that's not zero done time. So if you have a bit of users uh renaming a table or this kind of stuff, you you will bring your website down for a few sec even at I mean a small scale there's some actions that you should not do that. >> I can appreciate that. One of the other things I want to talk with you about is that you know you've been working at Dr. Lib for a while now and you've led several Rails upgrades there, right? >> Yes. I think you said you started using Rails around Rails 5. Was that approximately the version you were when you started working at Doctor Lib was that the version they were running on as well? >> Yes, basically when I joined uh there's like some freelancers that were upgrading from R 4 to R 5 and uh the day I joined. So they were doing that and then I took over for the remaining. Has that historically while you've been there been like a oneperson responsibility or primarily you or is there a little bit more of a team effort to accomplish that? >> It was a personal at the beginning and since now one year and a half two years I have a dedicated team. So we do it in teams but yes at the beginning there is no ownership over this kind of stuff. We had only feature team specialized on product and nothing related to the the platform and non-applicative code. So it was more on a you know boy scoot rule. That was my thing. >> What's something you learned the hard way about upgrading Rails apps? >> I think you cannot change something or it makes your test pass and that's it. I have been beaten by that too much. You know like I have this test it doesn't pass. I change that. The test is green. I don't really know why but now it's green and I'm happy and I just go with that and that's what will fail in production for any reason because I haven't deeply understood the change behind it and so I just either adapted a bit the test or change one line of the config but without properly understand the world picture the big picture and that's that's the thing so for every upgrade we do for every change in the codebase we add we have an explanation Okay, this is because we change that in the framework parling and we make sure we have understood the change upstream and that's the way it takes times. >> Are you able to then keep relatively up to date these days then or have you got to the point where you're like within a major version release or are you up to date right now? >> We are up to date. We don't spend much time on it to be honest. Over the year we have improved so it's easier. We have removed patches. we have tried to upstream stuff. We have also sticking to the default on some configuration etc. It makes everything easier. So overall now we we have a well oiled process and doesn't take too much time and I think rails have also improved a lot to ship less breaking changes or sometimes it's not even they don't know that it's breaking changes. Okay. And sometimes it's your fault sometimes it's a bit gray area. It's easy in rails to plug yourself in things you have taken care overall over the year to remove all those gracing and patches and things where we were using pre API. So it's now I would say pretty trivial. Have you needed to do anything like remove certain types of dependencies like certain types of gems that might have been used in the past that might have prevented you from upgrading because it was not going to work with the next version of Rails or is there kind of a philosophy about how you approach when you bring in yeah some >> so I listen the postcard studio ad with Jean where he was talking about these gems that put like yeah I don't want active record nine for example and yeah we removed that Because I think about for example this gem bullet I think for N plus1. It was really complicated to do with this gem because it take a while before they roll out the support for the next rail version. So I'm not blaming you. It's not so yes for this specific gem this was complicated. They had an art code version of rails inside active record inside it. And so we remove this kind of stuff. Overall I think that's really this hard limit. We talk with a lot of in my company um because we got we get brought in to help teams with their upgrades because they have a bunch of people working on their features and nobody on their team has a lot of experience doing it finishing an upgrade. Like there's plenty of people that have started upgrade project. They created a branch. They did that for a couple weeks. They got stuck had to switch back to a feature and then six months later they're like where was I on this upgrade and they can't figure out how to build momentum there. So they'll call a company like us to come in and help them with that. And but I'm always like ideally you as a team need to figure this out and how to keep this a regularly part of your process. And so at Dr. Lib scale you're able to like have a team that's just thinking about the developer experience. Do you think you'd have any advice for smaller teams on how they can start to mitigate that themselves? >> It's like a a lot of things to get better at it. You have to do it and put your hands on it. It's not that hard. You have to try to understand the changes and to that your best friend is often bundle open. You open the gem, you open the rail code base and you you try we have the chance to work with language that is easy you know it's so easy to read easy to understand. So most of the rail codebase most of the gem are very easy to grasp and I think we should take advantage of that to try to understand what have changed. Okay, the div then you have the pull request. Once you have the the pull request, you have the context and you have why it's done, why we change that and then you easily can make the change needed inside your code base. >> It's so easy just open up you use bundle open and look at the codebase and you you'll figure it out. You can read the Rails source code. It's not that scary. >> Yes, I haven't contributed to Rails that much. I think um one hand is enough. I am maybe less. But uh yet I I think I'm I have the the the repo open all the time in my computer and I spend days to it's amazing. I have tried to do the same. We are using React for the front end. I have tried to do the same with React. It's >> no no comment. >> It's a nightmare. I mean so >> it's also my understanding that Dr. Lib provides a CLI tool that helps your new engineers get set up. Could you tell us a little bit about that? >> It's a what we call DCTL. It's a CLI made by Go I think something like that. And we use it like an entry point for many things. We have a lot of commands into it. There is like a team dedicated to that. And we have also community based plug-in we can add it to the task. You get a new computer, you just tap the repository and then you are able to bin install DCTL. And once you have done you can just ptl de and it will set up your laptop end to end to have rebuild node then you can launch the application it will install all the tool then there's a lot of commands into it like if you want to connect to staging for example we can connect to the staging database to try some performance stuff to have a bit of data everything goes through that it's great tools for the onboarding I mean you just detail de and you're up and running >> that's interesting so you you can allow that to make it Easy. Are you using like Docker locally as well? >> Yes, we have Docker but only for data store. >> Okay. >> So, we have radius elastic search and uh post into it. >> So, you got your databases there. Your data storage rails is just running on your Mac machines or whatever. And then if you want to connect to a staging database environment is then that'll then just automatically connect that for you. >> Yeah. I just do detail staging connect and I'm connected to the staging database. >> I can see how that could be really helpful. What about things like seeding your local database with enough data given how large of a platform you have? >> That's a big topic we haven't talked about. That's I think one of the best thing around fixtures. It's once you have fixtures and if you use fixtures to your test, you have basically your SID environment. So we don't do that. We have a bunch of fixtures that we use but we don't use them in test. We use them to seed or on and that's basically how we do it. >> I know that some team have also some ra task for specific cases they seed when they want they don't add them to the fixtures set for everyone because it will be too slow but yeah we don't have anything magic around that. So we have some fixtures in yl that we load in the seed.b and skin. >> Do you think this is like a thing that it would be nice if the Rails framework itself provided some more functionality in this space or do you feel like this is only an issue with certain organizations that would get certain size of an engineering team or certain size because I'm I'm primarily talking like on the podcast with a lot of people that are working at really large comp. So I'm just like wow I feel like this is the seed situation that we might have on a brand new MVP is very different than 12 years later into an application. And so like how do we kind of connect the dots so we're not having to figure out all these little creative things? Every organization is trying to solve this problem themselves. To be honest, I haven't looked a lot about this topic. I have seen some people try to push their own DSL and etc around it. I don't know if R should push for something. Maybe they should double down on on fixtures and make this load test fixture by default. I don't know something like that. I I will be pretty happy about it because I think every toy project I do that's the first thing I do not sure we are needing more tooling around it often teams will need big corp etc we want to have real data from staging it's a bit too much for the framework to go in that uh that direction I think if you're at the stage where you need that you probably have the manpower to do it and anyway it will be too specific from what you have to >> it's interesting because I work in the consulting space and so we come in a clients and a lot of smaller teams don't have really super efficient ways to do it and plenty of clients like will see that their team is needing to pull like a production database snapshot into like a staging like environment. They might be scrubbing the data and then using that so they can test out something because they can't test this work trying to fix an issue that might be only showing up production without having some more realistic data and we're not going to give their developers direct access to connect to the production database. So they're trying to find all these interesting, but they don't have that big of a team to figure all that out. And so they're like, well, what do we do in the meantime? And >> it's there's not a lot of good patterns, I think, that are at least not easy to find in commonly shared because like everybody's like, I don't really know how to best like and they ask us for advice and I'm like, well, you can do what these big companies are doing when they have someone specially focused on that problem and like you can't afford that. So I don't I don't know what to say. >> Yeah. >> Be more successful as a company. I don't know. No, I don't know. But for this kind of stuff, we connect to staging that is refreshed from production and where we scrub the data and it's that's how we do basically. >> I want to circle back to upgrades in particular. What's the strategy there? Are you doing anything like dual boot there? Are you running things? You have a branch that's running against the latest code in main on rails or >> Yeah. So when there's a new version of shipping I mean a major one we during depends of the time frame we have etc. But if we have the time early we start with the alpha often there is the alpha. So we start to open the the pull request run the bin update something like that command to update the configuration etc. And the first thing we try to have is the CI running. Okay often the CI will crash. So we fix this kind of crashes and on the CI is running we know the number of tests that it will be failing. Then we do the back and forth with the CI driven development where we try to have the CI. So we will fix everything by yourself. Every test, every code, every change needed. So we won't ask team to do it. We will do it ourselves. Sometime we consult them to better understand the feature but we basically do everything oursel and we have something close to green. We will backport changes. So I mean by that that everything that is needed for the next version but we can already merge it to main we do it. So at the end of the day we want to have the smallest change possible. So we do pull request to up offload that and off that. I don't mention it but of course as I said just before we have a link to every change on f for every change we have done inside the p request and on the CI is green and we have the the smallest change possible. we will merge that through what we will call the dual boot. So once it's merged everything will run on the next version of rails. So locally all the CI all staging prepro environment this kind of stuff will run with the next version but not production and production CI. Okay. So everything is next except the production and production CI and then we do that for an amount of time. It will depends of the confidence we have with the change. It will depends also also with the agenda of the organization as it's a big change we often have to when we can do it and once we have the green light to do it we remove the the dual boot and we that's basically the process >> when you talk about backporting is do you organize those commits in a way that you can just cherrypick those commits in particular or or is that generally how that works or like to bring back into main or how granular do you try to keep those changes do or is it because you have this branch with a bunch of like you get the test suite passing again. Yeah, we we group them by uh R changes, you know, like uh okay, this configuration has changed. It need to we need to change this code, this code, this code. We will group them together. Okay, that's that's a change in rails that have lead to this diff. That's basically how we group them. Do you feel like you've gotten to a good pattern of like as you're navigating one of those upgrade branch and you got your branch, you're working through getting your test suite running in the first place and then working through the broken test at that point that you have a good like kind of pattern recognition to spot like, oh, this these things seem to be kind of clustered, so I'm going to focus on that area for a while. Because I know that when you do that sometimes it's just like, oh, there's a bunch of things going on. You're like, how do you prioritize where you go? And it's like different rabbit holes you could potentially go, which I think is what makes people a little nervous about doing it because it feels like you can go in any direction. There's a bunch of fires popping up in parallel. How do I approach this? >> Yeah. So in your case or CI is reporting us the most common pattern of failures across our test. So I run the world street and you will say me like 10,000 test is failing because of that. And I often focus by this one first, the big one. So you're pretty happy because you saw the number of green test grow a lot and yes the remaining 100 are the the slowest to do because it's always a different case but that's basically how we do it. >> That makes sense. You know we're thinking about tests. Could you tell us a little bit about how your team addresses and avoids is it safe to assume that you still have flaky tests appear at times? >> Yes, it's a massive topic. And we have something around almost 20,000 end to end test. That's something that doesn't grow a lot anymore. Completely change the strategy and we have not stopped but we try to avoid to have too much and to end test because of the flakiness the cost also but a big part of it is because of the flakiness. We have invest a lot on solve the flakiness problem over the year. I think it's a seven years project and uh so we have a lot of things. We have a I don't know if you have that in US but in Europe we we have this uh military score. So basically when you buy something which is food there is a score you know A B C D if it's good for your health or not. >> Oh yeah yeah yeah >> yeah. So we have basically the same for test. So we are having metrics coming from the main bunch and we are computing a score for a test. this test is likely to this number of percentage. So here's a mystery score and then the team have it and then we push bug to them. They have to fix them etc. H we have a lot of strategies on retrying if they fail two times in a row on the CI we automatically skip them and create a bug ticket to the team so they have to fix it. So yeah we have a bunch of process to try to and linking them with a new framework we have introduced capibara lock step also it's a German company that have open source that so basically most of flackiness come from the react front end so capy bar state is not in sync with the front end state and you try to click on a button but the javascript is not ready and it failed okay or the dropdown is popping but not finished yet and we try to it doesn't work button is moving and click not on that's often this race condition with the front end it's like 70% 75% of our flakiness so capy bar step try to address that by synchronizing capibara with the state of the front end so capibar will have basically a mutx and do nothing until the page is loaded until ajax query have been fired the browser is the network is etc etc that have fixed a lot of problem we had I think but still there is still a lot of lackiness coming from the front end and so that's why we have all this process and uh I think the best strategy we had is basically make them less noisy and forcing people to work on it by skipping them >> just pulled that up so I see the capiara lock step not lock okay so I'll include links to this in the show notes as well it's such an interesting thing that >> yeah it's really nice >> why can't capibar just do this from from from the get- go. >> Why is JavaScript so complicated? Um >> I think there it will improve with so to do that capab JavaScript snippets you need to include and stuff. But Selenium is mating to BV which is a new protocol to communicate with browser. So it directly communicate with the browser instead of having the the Chrome driver. And thanks to this new protocol, Selenium will have more information about the browser states and uh can improve. I hope we will at least be able to build more real tool to synchronize the state. You mentioned your team uses React. Is that code in within the Rails app or using like React on Rails or is that a separate repository where all the like the front end loop? Is this opening up a can of worms? >> It's a good question. So when I join I never use react but I think there is like this Ruby helper where you can do react component and it will mount a component into your slim page or I mean your ERB page etc. So we have this pattern in the codebase and that's one of the way we did in the past for some part of the application like in admin pages we did that also in the patient website but no now we we don't do that we have a proper spa it's still the monolith that render the first HTML layout but then directly m the spa into it >> I see and so in a local environment then is that then people are spinning up the react spa and then has the rails app running as well there. >> Yep. So we have a web pack dev server running and application which is not ideal today react web server is a web deer sorry consume a lot of RAM put our machine under EV pressure it's one of the biggest driver of bad dev experience generally >> are there any non rails patterns or ideas that have been especially successful in your codebase there >> to be honest I don't know uh we mostly Let's stick to what rails have maybe in three or four years I will tell you because with all this modularization topic and new services going on we starting to introduce Kafka and so we will probably see some patterns you don't have in usual rails app coming so yeah there's nothing coming to mind I mean the engine stuff etc I think it gave us a a good runway it's not perfect but it gave us a good runway Is there anything that your team does differently than you think most Rails teams do? >> I don't think so. No, I mean uh with the scale we have different things. We have 10 database this kind of stuff but uh at our scale I think people will do the same as us. I did my best to try to stick to it, you know, and if I can say we have been pretty lucky on that direction because Shopify have paved the way for many things and they have put that in the framework. So we have benefit a lot for that. They give us okay the direction and they also gave us the tools when we needed to have multi database Ellen just merge it. She was at GitHub at that time but just merge it in rails. So we had to upgrade and up we had the multi database. So there's a lot of things we do. We just get it from either Shopify or raise because they have done it and then they merge it. >> It seems to me that Rails has definitely been part of Dr. Lib's success maybe early on. I don't actually know a lot about how the organization started but was one of the founders a software developer themselves out of curiosity. So there's three funders and out of those three funders two were technical >> and they started that first Rails app. >> Yeah, exactly. >> It's a seems to be a common uh theme in a lot of the the conversations I've been having with different companies. And so do you think that Rails is still one of Dr. Lib's I'm air quoting secret weapons? >> I think so. Yes. pretty sure some people at company would say no but uh I'm kind of the people that says that yes it's a good example but I think on boarding every rails app I know look the same there's teeny differences you know like they use service object we don't okay there is query object we don't okay but most of them look the same and it's trivial to navigate to it okay I have an account I have an account test I have accounts controller I have controller test so it's allow us to on board I think people really fast because if they have a bit of experience with rails it's really uh easy for them to catch up if they don't the structure is so self explain you know and and ruby help a lot also here at the code level it's so easy to read so I think it's so trivial to classize and even in the big codebase I think it's a big advantage the simplicity I can appreciate that. You know, I think about when teams are growing and and people come in at different points and maybe they have different experiences or at a certain scale, people might come to an organization and they've worked with different tech stacks and they they have different experiences to bring and they're like, "Wow, we used to do this differently in this other company I worked at at this large of scale. Rails seems different in certain ways. Like I wish I had some of the things that I could lean on, but then like there's a lot of like there's different people in different roles as well." I would imagine I feel like sometimes I talk with people at large organizations and they're like there's some people that are like well not everybody loves Rails as much as I do but and like you know or it's just becomes this interesting kind of thing where you get different people with different like ideas there and not everybody there is living and breathing Ruby on Rails necessarily either because you've got your React developers I'm imagining and different people focusing on different things and like there's this interesting tension of the competing forces on like what's slowing different people down based off of where they're spending the majority of their time. And so it's good to hear that Dr. Lib is able to keep up on Rails. You have that large of organization. You're it's been able to scale with you and you're able to look at Shopify and see how Shopify is paving the way to make that possible. So you can keep working with Ruby on Rails there and kind of just follow their breadcrumbs that they're leaving as as they're going like follow us this way. It's working. This is going to work out. Those are things that we didn't have 10 years ago and we used to have to have these conversations like well rails might not work in this sort of larger situation that companies kind of get to and you have other issues where you're just hitting issues with like AWS database you know sizes and stuff like that and that's not necessarily Rails's fault that's not a Rails issue some people >> yeah as most of the web scalability things it's always a database >> couple of last quick questions for you is there a technical book that you find yourself recommending to teammates or peers over and over again. >> Uh to be honest, I don't read that much book, but uh I have this book, this software engineer guide book from Gags I have to read. I heard good thing about it. So maybe if I have to recommend one, it would be this one. It's not technical per se. It's more how you how tech lead, how you lead things, how you lead change. And I think when you work at big corp, it's one of the biggest challenge you have. It's always about humans and how you bring the change to the table and how you are able to have things moving forward instead of just changing code you know it's pushing code and change stuff it's it's the easy part >> I think uh refactoring your code is one thing refactoring how your team communicates and makes decisions on things is a whole another big challenge and I don't know that Rails solves that itself but I'll definitely include links to that in to that book in the show notes as well. I'm curious where can listeners best follow your thoughts or ruminations about software engineering or rails or does Dr. Lib have like an engineering blog? Do you publicly talk much about this stuff? You mentioned having gems and stuff that you've released as an organization as well, right? Where can I direct people to? >> Good question. I think we had an engineering blog at some point, but uh nobody's taking care of it anymore. So, we had a Twitter also, but I don't think people taking care of it anymore. So to be honest, I don't have much for the open source work. It's on your GitHub organization. >> I'll track down these for everybody. And maybe when this episode comes out, you'll have an excuse to write a blog post that like, hey, check out that episode on on rails on the engineering blog. There you go. I've given you a free blog post. Um, get that rebooted. Thanks so much, Floren, for stopping by to talk shop with us on Rails today. >> Thanks a lot, >> that's it for this episode of On Rails. This podcast is produced by the Rails Foundation with support from its core and contributing members. If you enjoyed the ride, leave a quick review on Apple Podcast, Spotify, or YouTube. It helps more folks find the show. Again, I'm Robbie Russell. Thanks for writing along. See you next time.

Video description

In this episode of On Rails, Robby is joined by Florent Beaurain, a longtime Rails engineer at Doctolib (@Doctolibfrance), home to one of the largest Rails monoliths in Europe with over 3 million lines of code and 400+ engineers. They explore how Doctolib’s team tackled massive test suite performance issues, including cutting one engine’s test time from seven minutes to under one minute. Florent shares insights from managing 84,000 tests, scaling across 10 PostgreSQL databases, and maintaining Rails upgrades across a fast-moving organization using systematic approaches like dual-boot deployments and careful backporting strategies. *[00:04:56]* – Doctolib’s CI suite runs 84,000+ tests per commit, using over 130 CPU hours *[00:07:38]* – Scaling test infrastructure with 45-minute pipelines, test selection, and parallel servers *[00:10:31]* – Improving local dev experience by letting engineers run isolated engine tests *[00:12:47]* – Database resets and factories identified as key bottlenecks for test performance *[00:14:23]* – Switching to transactional tests and revisiting fixtures to align with Rails defaults *[00:18:44]* – Dropping one engine’s test time from 7 minutes to under 1 minute *[00:25:50]* – Migrating 90% of the codebase to a faster testing framework in three months *[00:31:03]* – Using Packwerk to modularize the monolith—why zero-dependency engines are a myth *[00:36:14]* – Leveraging AI to automate cleanup tasks and support onboarding *[00:43:20]* – Hitting AWS Aurora scaling limits with 10 Postgres writers and 15 readers each *[00:50:15]* – Avoiding downtime with multi-step database migrations and rollback strategies *[00:52:11]* – Staying current with Rails via dual-booting, CI-driven development, and upstream tracking *[00:56:37]* – Advice for smaller teams upgrading Rails: read the source code and start small *[01:06:23]* – Managing 20,000 end-to-end tests with retry logic and Capybara Lockstep *[01:10:35]* – Using internal CLI tool (dctl) to streamline local setup and staging access Socials: Twitter/X: https://x.com/_beauraF LinkedIn: https://www.linkedin.com/in/beauraf/ GitHub: https://github.com/beauraF Company: Homepage: https://www.doctolib.fr Tools & Libraries Mentioned: AWS Aurora (PostgreSQL) – Their production database platform, scaled to 10+ writers and 15+ readers. (https://aws.amazon.com/rds/aurora/postgresql/) Capybara – Used for end-to-end testing of UI flows in the monolith. (https://github.com/teamcapybara/capybara) Capybara Lockstep – A JavaScript sync layer that helps reduce flakiness in React-driven feature specs. (https://github.com/makandra/capybara-lockstep) Datadog – Application performance monitoring and alerting for production systems. (https://www.datadoghq.com/) Docker – Used to run local PostgreSQL and other data stores for development environments. (https://www.docker.com/) FactoryBot – Used for generating test data; identified as a major performance bottleneck in large test suites. (https://github.com/thoughtbot/factory_bot) factory_fixtures – Shopify gem that extends fixtures with inline factory-style overrides. (https://github.com/Shopify/factory_fixtures) GitHub Copilot – Used experimentally to help with workflow automation and onboarding support. (https://github.com/features/copilot) Heroku CI – Previously used for parallelized CI builds before moving to custom infrastructure. (https://devcenter.heroku.com/articles/heroku-ci) Jenkins – Their original CI platform before scaling up to more powerful infrastructure. (https://www.jenkins.io/) Minitest – Their primary test framework, used throughout the monolith with some extensions. (https://github.com/minitest/minitest) Packwerk – Used to modularize their monolith into engines with explicit boundaries and dependency declarations. (https://github.com/Shopify/packwerk) PostgreSQL – Core relational database behind their production and local environments. (https://www.postgresql.org/) React – Their primary frontend framework, integrated into the Rails monolith via a single-page app architecture. (https://react.dev/) safe-pg-migrations – Tool to reduce downtime risks during large-scale schema changes. (https://github.com/doctolib/safe-pg-migrations) Sentry – Error tracking and visibility tool integrated into their release workflow. (https://sentry.io/) Webpack Dev Server – Used locally to support React development alongside the Rails app. (https://webpack.js.org/configuration/dev-server/) #rails #rubyonrails #tech #DoctoLib On Rails is a podcast focused on real-world technical decision-making, exploring how teams are scaling, architecting, and solving complex challenges with Rails. On Rails is brought to you by The Rails Foundation, and hosted by Robby Russell of Planet Argon, a consultancy that helps teams improve and modernize their existing Ruby on Rails apps.