Desktop AI Compared - From 2GB to 1024GB, Deepseek R1, Gemma3, and More!

Dave's Garage · 263.2K views · 14.7K likes

Analysis Summary

20% Minimal Influence

mildmoderatesevere

“Be aware that the performance 'sweet spots' suggested are based on specific hardware loans and personal projects, which may bias the recommendations toward certain brands like Dell or NVIDIA.”

Transparency Transparent

Human Detected

98%

Signals

The content is clearly human-authored, featuring a well-known creator (Dave Plummer) who provides specific personal project history, natural conversational filler, and expert-level hardware insights that deviate from formulaic AI scripts.

Personal Anecdotes and Context References specific past projects like the 'Tempest Arcade AI project' and personal security camera setups.

Natural Speech Patterns Uses colloquialisms ('gobs of RAM', 'speed costs money, kid') and natural self-corrections/asides ('watch the channel for that one coming soon').

Technical Nuance Explains the distinction between capacity and performance in a way that reflects hands-on troubleshooting rather than a generic summary.

Worth Noting

Positive elements

This video provides highly specific, real-world token-per-second benchmarks across vastly different RAM tiers, which is rare to see in a single controlled comparison.

Be Aware

Cautionary elements

The casual mention of hardware 'loans' from manufacturers like Dell can create an unconscious bias toward recommending high-end enterprise gear for consumer tasks.

Influence Dimensions

How are these scored?

About this analysis

Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.

This analysis is a tool for your own thinking — what you do with it is up to you.

Analyzed March 13, 2026 at 16:07 UTC Model google/gemini-3-flash-preview-20251217

More on This Topic

Related content covering similar topics.

Mac Studio vs. Nvidia: Running Large Models Locally!

Heavy Metal Cloud

Minimal Transparent

local ai large language models

Is the Nvidia Tesla V100 still good for AI? - Inspur DGX V100 vs RTX 5090

Craft Computing

Minimal Transparent

large language models gpu benchmarks

This little machine is a BEAST

typecraft

Low Mostly Transparent

local ai ollama

Transcript

Hey, I'm Dave. Welcome to my shop. Today, we're going to look at running a large range of AI models entirely at home, right on your own desktop. But once you leave the cloud behind, you become entirely dependent on your own hardware. And it's sometimes hard to know how much is enough. And that's the question we're going to answer today. At each of the price points, we'll increase the amount of AI memory available at each step, starting with only 2 GB until we test out systems with 512 GB of memory and even a full terabyte of RAM. And at each step along the way, I'll show you what's possible with it and how fast it is or isn't. Now, there's a persistent myth out there that you need gobs of RAM to do anything with AI. But I'm going to start out by showing you that this is not always the case. It's much smarter and more effective to let your actual needs drive your hardware spec. So, instead of buying a GPU and then finding out later what it's possible to do with it, let's just dive right in and see what we can do at the various memory sizes. Down at the smallest end of the spectrum, you'll find devices intended for edge computing, like the Jetson Orura Nano that I reviewed a few months back. It's a tiny board about the size of a Raspberry Pi, but it features a complete CUDA capable Nvidia GPU backed with 2 GB of RAM. That's the same amount of RAM you'd find in a GeForce 710 card that you can grab today on Amazon for 54 bucks and run Kudo on that. So either way, it's a pretty small investment. Now, 2 GB of RAM may not sound like a lot, and it isn't by most measures. But if the model you're trying to run fits into the available memory, that's 90% of the battle. And for simpler tasks, the models can actually be quite small. For example, the DeepQ model that I used in my Tempest Arcade AI project, watch the channel for that one coming soon, has just under 100,000 total parameters, and so it could run in only a few hundred K of memory. That means it fits and runs easily even on the 2 GB Nano. I've also used the Nano for AI vision tasks such as using the YOLO library to do license plate recognition on my security cameras. Not only did the model fit nicely into the memory, but performance was completely acceptable for the task that I was performing as it could still process several frames per second of visual inference. And that's the key. You want the most affordable piece of hardware that has both the capacity to hold the models that you're interested in combined with the performance needed to run them at a rate that you can live with. In the Tempest case, I could at least in theory do all of my training on the Orin Nano, but it runs about 10 times slower at least than the RTX 6000 ADA cards in the Thread Ripper, a setup that we'll check out soon enough. So, in my case, while the Orin Nano had the capacity, it lacked the required performance. And in each of our cases, I'll show you what each increasingly large system has the capacity for, show it to you in action, and then you can decide on the price point that meets your needs. Because when it comes to AI, speed costs money, kid. how fast you want to go. To run and test each of these models, we'll be using Olama, an application that allows you to run AI models entirely locally. We'll start by going to ola.com and then clicking on the install link for our operating system. Once the installation has completed, it will add itself to the command line, launch the server in the background, and we'll be able to run our AI. To do so, we have to next download a model. And the Alama site has a rich repository of models to select from, as well as some surprisingly fast download speeds. We simply head to olama.com. Select a model that will fit into our hardware and download it. With only 2 GB of RAM to get started with, we'll need to start with one of the smallest models. And it looks like the Gemma 3 model with 1 billion parameters is about 815 MGB in size. So, it should work nicely for our purposes. To actually install the model, I simply type ola pull and then the name of the model including its size after the colon. So, I'm specific. And so my complete command line is Olama pull Gemma 3 col 1B. Depending on your internet connection will take anywhere from a few seconds to a couple of minutes to download the model. And once it does, you can run it with the command line Olama run gemma 3 col 1B. I like to add the verbose flag to my command line to get a sense of the raw performance when evaluating a model. And so I'll do that here as well today. Now, I'm not going to be overly concerned with the actual text quality of the model output as there are dozens of models to select from and the one you pick will be chosen first by what fits in memory and second by the capabilities of the model. But those will vary wildly and there's no simple score I can give you for each and nobody agrees on what those scores should be or how to test them. Plus, the models are constantly changing and evolving. So, my recommendation is to keep an eye open on the Lama site for the popularity leaderboard. Odds are the model you want for general use will be among the more popular. Once O Lama has loaded and initialized your model, you're ready to go. And speaking of ready to go, my subscriber count is poised to crack 1 million. As you likely know, I'm mostly in this for the subs and likes. So, with just a few more subs, I'll be a millionaire. Either way, I'd be honored if you'd take a moment to subscribe to the channel and help me push over that hump. Back to our model. Let's just ask it to tell us a story. That will give us some sense of how verbose the model is and what rate it can generate tokens as long as you remember to specify the verbose flag. Now, as with all cases here, I'll first let you see the model generating tokens at its actual speed on the very hardware that we're testing on. And then I'll speed up the footage to let it complete so that we can get to the end without waiting around on a geological time scale for some of the more demanding models to finish. And when Gemma 31B cranks out our answer, we can see that it did so at a rate of almost 30 tokens per second, which is pretty darn fast and certainly fast enough to be useful as long as the 1 billion model can do the work you need. If I showed this to you 5 years ago on 50 bucks worth of hardware, it would have blown your mind. So, at least keep that in perspective. And next, let's upgrade that perspective a little bit by moving up in RAM capacity. I'm going to jump up to the 8 gigabyte level as that's a very common capacity level in consumer GPUs, something many of you already have on hand for other purposes. To do so, I'm going to turn to the Tesla P40 card in our 45 Sto because, as you'll learn soon enough in a forthcoming episode, as long as you're subscribed, of course, we recently upgraded the Storinator Q30 to an epic CPU with a terabyte of conventional RAM. And that's not even considering its 420 tab of disc storage. The P40 was a mid-level server GPU that fit in a single slot, and it doesn't even have any HDMI outputs, so it's strictly for server use. That does make it handy for AI workloads, and it features 8 GB of video RAM and has CUDA support. Now, the model I wanted to try was the same Gemma 3, but this time in a 12 billion parameter variant. It was listed as requiring 8.1 GB of GPU space, so I wasn't actually sure if it would technically fit in 8 GB or not, so the only way to find out was to try. And the P40 brought a mix of good news, bad news. The good news is that it was able to squeeze the model into memory and it ran. The bad news is that it ran really slow. The P40 just doesn't have the chops to run this model at what I would call live speeds. I want the model to produce text at least as fast as I can read or otherwise you wind up waiting on it. And the P40 could barely muster two tokens per second. That started to worry me as the models we're going to try today are only going to get larger and larger. Once we move to GPUs with more memory, the models will be more complicated. And will the processing horsepower keep up with the storage capacity? Well, again, there's only one way to find out, and that's to bust out the Thread Ripper. You see, Dell had graciously loan me a really nice workstation for a while. So nice, in fact, that it had the top-of-the-line 96 core Thread Ripper and not one, but two RTX 608 GPUs, each sporting 48 GB of GPU memory for a total combined 96 GB of GPU memory. But at some point it had to go back and it did so early this year. Then I started working on the Tempest AI project which we'll feature in upcoming episodes and I needed something with more inference horsepower. I mentioned this to Dell and they sent the machine back to me. So I let it take a break from doing Tempest training large enough to do some large language model work. Even though this machine technically has 96 GB of GPU memory, you're not going to be able to just naively load a 96 GB monolithic model. That's because memory is spread across those two GPUs and by default will only be able to use a single unit. Now, you could split the model into layers perhaps and run each layer on a different GPU, but that's a lot of work and we want to fairly compare how things work out of the box. And so, for that reason, we're constrained to using a single GPU only. Now, depending on who you ask, the RTX 680 units are somewhere between the 4090 and the 5090 in performance, probably closer to the 4090. Its main claim to fame in the AI world, however, is the much larger memory capacity than you'd find on either of those consumer GPUs. And so I went looking for the biggest model I could fit into 48 gigabytes. And I found it in the Deepseek R170 billion parameter variant. Weighing in at 43 GB, it's a substantial download. Now, I'm fortunate to be on 5 GB fiber, so it goes pretty quickly, which is a testament to the bandwidth that the Lama model hosting must have. It's like steam in its ability to saturate your download pipe. Once a download is complete, it will run through an MD5 hash check to make sure the model is intact and unmodified, and then it will launch. The first load takes some time, especially on these larger models. After all, loading 40 GB from a fast SSD is still going to take 10 seconds or so, no matter what, so a bit of waiting is going to be inevitable. Once the models loaded, however, I was very pleasantly surprised by the performance. It produced results at what I'd call a fast reading pace, about 20 tokens per second. That's enough to keep me busy as it's producing its answer and it's fast enough to be hooked up to something like Visual Studio for local AI assisted development. Now, stepping up to the next level at 128 GB meant that unless somebody's going to loan me an Nvidia B200, we're going to have to leave dedicated GPUs behind and move up to unified memory of the kind that you'd find in a modern Apple machine or a Windows Ryzen desktop with integrated graphics. And since I'm fortunate to have a 120 GB M2 Mac Pro on hand as my main machine, it made perfect sense to run that same model DeepS R170B on the Mac to compare to how it tests on the Nvidia. And so that's precisely what I did. Downloading, installing the model is the same, of course. And then I just asked the model the same prompt, tell me a story. And as soon as I did, I was greeted with an experience very similar to what I'd seen on the Nvidia RTX 6000. The text was flowing at a comfortable reading rate, though perhaps not quite as fast as the prior test. I wasn't sure yet. But when I saw the numbers, they fell into line with my expectations, 12 tokens per second. It's still fast enough for live use and production coding, but slower than what we saw with the Nvidia GPUs. But to go even larger, we'd need bigger hardware. Hardware I don't own. Hardware that nobody has offered to loan me or would have been in this episode. But then I got a serendipitous email from a fellow named Riff. He said that he was a developer who works in fintech and a Google developer expert and that he had access to a 512 GB Mac M4. He was willing to let me use the machine remotely and so we set up Tailscale and soon enough I was able to use Apple screen sharing to connect directly to the desktop on the other side of the world. Now I figured Riff is a Google evangelist so he'd probably like it if I started with a Google model. So I picked the largest of the Gemma 3 models coming in at 27 billion parameters. At 17 GB, that makes it too large to run on all but the largest consumer GPUs, but it should run easily on the max unified memory architecture. So, I gave it a quick download and test using the same prompt as always. And I'd say that the 27 billion parameter Gemma 3 on the Mac is about the sweet spot as it's a very capable model and was able to generate more than 23 tokens per second, more than enough for live desktop use. It's also rather unique in that it supports a massive context window of up to 128,000 tokens. So, our tell me a story prompt certainly doesn't even scratch the surface, but you could upload a huge amount of rag context with your query. But we were on a big Mac for one primary reason, to run the largest model we could find. And for the last several months, that has remained Deepseek R1's original, coming in at 671 billion parameters and requiring a whopping 404 GB of GPU accessible memory just to load it. But on paper, we had the hardware for it, so I spun it up and gave it a shot. Now, as a little side quest, the 45 Sto has a full terabyte of RAM, or 1,024 GB. It's not GPU accessible memory, but you can run a Lama on just the CPU if the model won't fit into GPU memory. So, I figured I'd give that a try as well. And I was able to load the massive model with ease, but its performance was uh wanting. It was about three tokens per second, sometimes two, and this is on the epic CPU. Too slow to do anything live with. But now, we could bring the GPU to the table. And doing so brought us firmly into the realm of things like a dog playing the piano. Something that it's impressive that it can do it all, regardless of how well it actually does it. And while it's impressive that the 512 GB Mac M4 can run the model at all, it can still only do so at about six tokens per second. Now granted, that's more than twice as fast as the good CPU, but I was hoping for a little more. But it looks like live performance with the largest of the models require hardware akin to that B200 or the forthcoming DGX station. Neither of which I have, but you never know, maybe one day. If you enjoyed today's look at the very large language models, please consider subscribing to the channel for more like it. And if you could drop a like on the video to make the algorithm happy, I'd appreciate it. I'm always eager to hear your comments and questions, and every Friday on Dave's Attic, we go through the best of them on Shop Talk. I'll put a link in the video description for you and encourage you to check it out. Thanks for joining me out here in the shop today. In the meantime, and in between time, I hope to see you next time right here in Dave's Garage. Do it, Lyn. Do it. Do it.

Video description

Dave compares multiple desktop AI models running at various hardware price points from 2GB up to 1TB of memory and reveals the performance of each. Free Sample of my Book on the Spectrum: https://amzn.to/3zBinWM Check out ShopTalk, our weekly podcast, every Friday! https://www.youtube.com/@davepl Follow me on X: https://x.com/davepl1968 Facebook: https://www.facebook.com/davepl