We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Analysis Summary
Ask yourself: “Did I notice what this video wanted from me, and did I decide freely to say yes?”
Worth Noting
Positive elements
- This video provides specific, reproducible benchmark data regarding VRAM overhead in Windows vs. Linux that is highly relevant for users with limited hardware.
Be Aware
Cautionary elements
- The transition from technical advice to career coaching uses the 'authority' of the benchmarks to suggest that the creator's paid community is the only path to professional AI competence.
Influence Dimensions
How are these scored?About this analysis
Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.
This analysis is a tool for your own thinking — what you do with it is up to you.
Related content covering similar topics.
Transcript
If you want to become a local AI expert, then this video is the right one for you. You're going to learn why Linux is the right move to optimize your local AI environment. And I ran over a 100 benchmarks on this very GPU to prove it. Now, here's the truth. Linux wins, but not for the reason you might expect. When I tested raw inference speed, Linux was only about 2 to 3% faster than Windows. And that's honestly not that impressive. Here's what truly matters. When I loaded models with 60,000 tokens of context, which is what you need for real AI coding, Linux used about 800 megabytes less VRAM on this GPU than Windows. And every single test showed the same thing. Consistently 800 megabytes saved. That means you get more room to load larger models on your GPU or increase your context window. And if you've got a 16 GB GPU, that 800 megabytes is 5% of your total VRAM. This is the difference between your model fitting in memory or not fitting at all. So, let me show you exactly how I got these numbers and why I set up my test environment the way I did. In my local AI coding video, I was trying to run a coding model and I kept running out of VRAMm and that's what sent me down this rabbit hole in the first place because I wanted to know how much of my GPU the operating system was actually eating up. To test this properly, I needed a clean Linux environment. Now before the Linux gurus in a comment start recommending their favorite distribution, let me explain why I specifically went with Ubuntu. The thing is AI engineering is already hard enough on its own. You're collecting data. You're understanding how to fine-tune and you're debugging inference servers. Honestly, the last thing that I want to be doing is spending three whole days fighting with Nvidia drivers. With YUbuntu, the Nvidia drivers just work. CUDA installs cleanly and when something breaks, I want the debugging to be my actual code and not the operating system. So, of course, Ubuntu is not the most hardcore Linux choice, but it gets out of my way and lets me do the actual work. And if you've never touched Linux before, it's still one of the most beginner friendly options out there. Now, there's another reason for choosing Ubuntu. Most of the serious AI tools, the stuff that production systems actually use, they talk to Ubuntu specifically. I'll show you what I mean by that later after I show you the live benchmarking test. So, here's the actual test rig for the benchmarks. I'm running an RTX 1590 with 32 GB of VRAM. To make sure that the comparison was completely fair, I used two identical SSDs. I also installed Windows 11 on one of them and YUbuntu 24 on the other. When I want to switch operating systems, I just boot from the different drive. Otherwise, I'm using the exact same hardware to test the large language models. Welcome to my Linux machine where we are going to be running the benchmark. You can see my GPU is ready to go and I'm going to run the crossplatform test bash script right now. Before the actual test gets started, it runs a heat up script because you want to make sure that the tests are as comparable as possible. So you just want to run a little bit of a pre-est to warm up the GPU a little bit so that for example any boost clock speed that might be hit while the test is running is at least not going to be inconsistent between the different tests. The GPU is going to be warmed up before every single test. That's the whole idea behind that. And right now we are using the Llama 3.1 model. And this test benchmark will actually run through various tests for both the Llama model as well as the bigger Mistl one. And you can find it again as part of the Linux AI benchmark repository in the link in the description down below. You can indeed see here that we've got a readme that will instruct you how to get this up and running for Linux and Windows. And in our case, we are just using Docker to invoke this. And Docker has a lot of great runtime capabilities to be able to make sure that your GPU is fully utilized. And we'll see that as the test goes on as well. At this point, we're actually running the proper tests. You can see here we are doing some decoding tests now which basically means that 2,000 tokens are being generated on an empty context window. And the most important thing here is that we're testing performance. So I'm not evaluating how useful the element answers are. It's purely decoding and seeing how fast the GPU really is. Now, of course, when I ran the actual test, I wasn't recording. You can see that my GPU is actually just a bit distracted because it has to record the screen here. I of course ran all of these tests in the background and you can find the full benchmark results for Linux, Windows and even Windows Subsystem for Linux in the repository as well. And when you're running this kind of benchmarking, it's important to get the machines close to each other in terms of the software you're running. So the Nvidia drivers as well as the CUDA versions have been aligned as best as possible to be able to get consistent results. One good thing to note is a lot of the code here is in Python. And Python is a slow programming language, but the actual underlying code that's invoking the AI model is of course not Python. It's much lower level, usually C or C++ code. So you don't need to worry about that. The Python wrapper is just a thin layer on top that just keeps track of things like the GPU metrics. So we can store that in our final test report. So now we are starting the context stress tests. And these are super interesting because a lot of YouTube videos that show local AI, they just show local AI running on a completely empty context window and then everything generates with like 200 tokens a second and they say, "Wow, look at how far local AI has come." But the truth is is that most real AI use cases locally like Agentic AI coding will fill up your context window. And as your context window fills up, things will become slower very quickly. And we'll actually see that right here. When this test starts, it will generate 512 tokens. But it will do so first in an existing context window of 10,000 tokens, moving up to 60,000 tokens with 10,000 tokens for each iteration. So indeed, here we start out with just 10,000 tokens in the context window and you can see that the tokens per second is 117. Still quite good, right? But now if we have 20,000 tokens, you can see that we only get 83 tokens a second and the degradation is already minus 29% compared to this baseline. Now things only get worse as we increase the context window. At 60,000 tokens, we end up losing almost 75% of our performance. So there's lots of results here and I could spend half an hour to describe all of them, but what are actually the results that matter and that you should know about right now? I tested two models. First is llama 3.1 with 8 billion parameters quantized to Q4 and this is your entrylevel model something that a lot of people with mid-range GPUs can actually run. And the second one is Mistro small 24B also quantized and this one is closer to what you would actually want to use for a more serious use case like agenda decoding where the model needs to handle more complex instructions and tool calls. So, let's have a look at the numbers. Across both models, the Llama model and the Mistro model, Linux consistently saved around 800 MGB of VRAM. That consistency is what I find really interesting here. Whether I tested the smaller model or the bigger one, the savings were basically the same. And this tells me it's not really about how the models get loaded or how the inference engine works. It's pure operating system overhead. Windows just reserves more GPU memory for its own processes than Linux does. Now, why does this matter so much? Well, because when you run out of VRAM, your model spills into system RAM and your whole machine grinds to a halt. Your AI model will become super slow. I showed that in my local AI coding video. So, on a 16 GB card, for example, at 800 MGB of savings is 5% of your total VRAM. And that could mean that you can now fit a model that was just barely too large on Windows or squeeze in a few thousand more tokens of context, which is great for your AI coding use cases. Now, I know that some of you clicked on this video expecting me to tell you that Linux is way faster for inference as well. Well, across all my tests, the average speed difference for these models was only about 2 to 3%. Some individual tests even came under 1%. So, in my view, a few% speed difference is honestly not reason enough to switch operating systems on its own because you're not really going to notice that in your day-to-day workflow unless you're running huge batches of work. But the 800 megabytes of extra VRAM that actively changes what you can achieve with your existing hardware. And beyond these benchmarks, there's another big benefit to Linux that you're really going to feel over time. On Windows, doing AI development means installing a ton of stuff at the system level. You've got the CUDA toolkit, Visual Studio build tools, multiple Python versions, various compiler tool chains, and over time, all of this stuff accumulates. Different projects, they need different versions, and eventually things might even start conflicting with each other. And while you try to uninstall something, good luck getting rid of everything cleanly. There's always leftover registry entries and environment variables pointing to stuff that doesn't exist anymore. On my YUbuntu machine, almost everything lives in Docker containers. If I need a specific CUDA version for a project, that goes in the Docker file. If I need some weird build tool chain, that gets containerized, too. My actual system stays minimized and clean. The only thing it costs is just more storage. And when I'm done with a project, I can delete the container and everything is properly removed. Docker also runs it natively on Linux without any virtualization layer involved, which brings me to an important point. Why am I not using Winter Subsystem for Linux 2? Some of you are probably thinking this, you know, they might be convinced that Linux has benefits, but why can you not just run Linux inside of Windows? Well, if you run Linux inside of Windows with WSL 2, don't you have the best of both worlds? Well, I tested that, too. And I want to save you the trouble. Using WSL 2, you will actually consume more VRAM than just Windows alone. About a full GB more than native Linux, in fact. So, you're not really getting the Linux benefits now. You're getting the worst of both worlds. So, winter subsystem for Linux is fine for general development work because you're not going to notice this drawback when you're just creating a web app. But if you're trying to maximize your GPU for local AI, I don't think there's really a shortcut. You need native Linux. And here's something that might push you over the edge if you're still on the fence. The serious AI tools all target Linux. VLM, the industry standard for serving LLM to production, is Linux only. Tensor RT, Nvidia's optimized inference engine, Linux only. And the Lambda stack lets you set up your entire deep learning environment with one command and specifically targets YUbuntu. And this allows you to not only run AI, but fine-tune and train models yourself. The companies hiring AI engineers often love people with Linux experience because the infrastructure that you work with professionally often relies on Linux. So learning Ubuntu isn't just about getting some VRAM back. You are building real skills that transfer directly to professional AI engineering. So, if this managed to convince you, here's the path forward. You can just grab a separate SSD and install YUbuntu on that. Keep Windows on your existing drive. When you boot your computer, you can pick which drive to start from. Windows stays completely untouched and there's no risk of the two operating systems messing with each other if they are on separate drives. And that's the setup I've been using for these benchmarks. Nowadays, the Ubuntu installer takes like 10 minutes, and things like Nvidia drivers are even installed by default. Now, there's still a bit of a learning curve, but it's way smaller than it used to be. Now, don't just take my word for this. The benchmark tool that I created is on GitHub, and the link is in the description down below. You can run it on Windows, install YUbuntu, and then run it again to see the difference for yourself. And that's how you become a local AI
Video description
🎁 Get the FREE Linux AI Kit from the video: https://zenvanriel.com/open-source ⚡ Master AI and become a high-paid AI Engineer: https://aiengineer.community/join I ran over 100 benchmarks to find out if Linux actually beats Windows for local AI - and the results surprised me. Linux wins, but not for the reason most people think. The real game-changer isn't speed (only 2-3% faster). It's the 800MB of VRAM you save on every single model load. On a 16GB GPU, that's 5% of your total memory - the difference between your model fitting or not fitting at all. What You'll Learn: - The actual benchmark results: Linux vs Windows vs WSL2 on identical hardware - Why VRAM savings matter more than raw inference speed - How context window size destroys your performance (75% degradation at 60K tokens) - Why WSL2 gives you the worst of both worlds (uses MORE VRAM than Windows alone) - The Docker advantage: keeping your Linux system clean while Windows accumulates bloat - Why professional AI tools (VLLM, TensorRT, Lambda Stack) target Ubuntu specifically - How to set up a dual-boot system with zero risk to your Windows installation Timestamps: 0:00 Linux wins local AI, but not why you think 0:30 Initial Test Results 1:27 Why Ubuntu specifically (not your favorite distro) 2:49 Live benchmark demo on Linux 5:41 Running LLM Tests 6:32 Important Test Results 8:28 Another Linux Benefit 9:31 Why not WSL2? 10:18 Other Linux Tools 11:06 Start with Linux Why did I create this video? I kept running out of VRAM while trying to run coding models locally. That frustration sent me down a rabbit hole to find out exactly how much GPU memory Windows was eating up. After running over 100 benchmarks, I wanted to share the real data - not opinions, not vibes, actual numbers you can verify yourself with the benchmark tool I've open-sourced. Connect with me: https://www.linkedin.com/in/zen-van-riel https://www.skool.com/ai-engineer Sponsorships & Business Inquiries: business@aiengineer.community