NVIDIA users: QWEN3 is FREE, but you’ll pay double

Alex Ziskind · 165.9K views · 3.5K likes

Analysis Summary

40% Low Influence

mildmoderatesevere

“Be aware that the performance 'bottlenecks' on NVIDIA are specifically framed around quantization levels that just barely exceed consumer VRAM limits to make the Apple Silicon alternative appear uniquely capable.”

Ask yourself: “Did I notice what this video wanted from me, and did I decide freely to say yes?”

Transparency Mostly Transparent

Primary technique

Human Detected

95%

Signals

The video features a known tech personality (Alex Ziskind) using natural, unscripted speech patterns and demonstrating physical hardware. The presence of spontaneous reactions and specific technical troubleshooting indicates a human-led production.

Natural Speech Patterns The transcript contains natural filler phrases, self-corrections ('Hold on, hold on before you misunderstand'), and conversational asides ('Why couldn't they make it just a little bit smaller, you know').

Physical Interaction and Context The narrator refers to physical hardware in their immediate environment ('this little box right here', 'the 5060 that I have here') and performs real-time testing.

Personal Voice and Expertise The content reflects a specific creator's workflow and personal hardware collection, rather than a generic script-farm style.

Worth Noting

Positive elements

This video provides specific, data-driven benchmarks for the Qwen 3 Coder model across different hardware configurations, which is highly useful for developers deciding on local AI setups.

Be Aware

Cautionary elements

The technical benchmarks are curated to highlight a specific 'VRAM wall' on NVIDIA consumer cards, making the M4 Mac Mini appear as a more 'logical' financial choice than it might be for users willing to use slightly higher quantization.

Influence Dimensions

How are these scored?

About this analysis

Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.

This analysis is a tool for your own thinking — what you do with it is up to you.

Analyzed March 14, 2026 at 18:23 UTC Model google/gemini-3-flash-preview-20251217 Prompt Pack bouncer_influence_analyzer 2026-03-13a App Version 0.1.0

Transcript

There's a brand new totally free open- source LLM that's built for coders and I've been waiting for this one. Quen 3 coder, but it's 480 billion parameters and you're going to need a big chunker to run this 510 GB model. Hold on, hold on before you misunderstand. I mean, you're going to need a lot of VRAM to be able to run it. Either unifying memory on Apple silicon or a bunch of discrete GPUs altogether, aka a data center. But they were nice enough to also release Quen 3 coder 30 billion flash. And this is the one that we'll be able to run at home. But for those with even the top-of-the-line consumer GPU, the RTX 5090, there's a slight issue. The 5090 has 32 GB of VRAM. 32. And this model in Q8 is 32.48. Just barely barely not fitting in there. Oh, if you don't know what I'm talking about, Q8 is the quantization level. Quantization is how much the model has shrunk down in order to fit in video cards like this. So, this one is not going to fit by half a gigabyte. We're going to test that out right now. And Q4, it means it's shrunk down even more. Is 18.63 GB, which is just outside the range of the 16 GB that the 5060s and the 5070s have and the 5080s. So, you might need to drop down to the previous generation 4090, which has 24 gigs of VRAM in order to be able to run the 4-bit quantized version. But believe it or not, this little box right here has 64 gigs of unified memory and it can run both of those quantization levels. Three of these will fit into my 5090. And of course, the M3 Ultra will be able to do that as well. Here's Quen 2.514 billion, which is the previous version, which is 14.62 GB. And we're going to run this on the 5060 that I have here. Offloading all the layers. Let's go. For some reason, this is still not fitting quite nicely, and we're seeing a lot of CPU action happening here. 13 GB out of the 16 available are being used on that GPU, and we're only at 43% utilization, which is not really great. We're getting 11 tokens per second there. However, the new model, Quen 3, has better utilization, more than two times the number of parameters. It's not that much bigger. And check this out, it's faster. Even though we're not using the GPU to its fullest, and we're not fully loading that model onto the 5060 because it only has 16 GB, we're still using the CPU. But look at this. We're getting 30 tokens per second there. That means this model is way more efficient. You're going to be able to run more parameters faster across all the hardware. Even if you have to use some of the CPU, you want to have your model fully in the GPU. So, it's the fastest. You don't want any CPU spill over because then it's going to slow things down. >> And we're about to see that. So, now I'm going to switch over to the 5090, which has 32 gigs of VRAM. And I'm going to load up the Q8 version of this model. And that's just a little bit too big to fit in 32 gigs. Why couldn't they make it just a little bit smaller, you know, instead of a little bit bigger so that it fits completely in 32 gigs? It'll be nice and neat. But no, it's still going pretty decently fast. 24% utilization. We're loading only 26 out of 32 GB of VRAM. The rest is being offloaded to the CPU, unfortunately, giving us 31 tokens per second. So these days, I'm constantly flipping between models. GPT4 for notes and email, Claude for code refactors, Flux for image generation, cling for video, four tabs, four bills, and counting. Enter Chat LLM teams. There's one dashboard that houses every top LLM and route LLM picks the right one for you for a given task. 04 Mini High for fast answers. Claw Sonnet 3.7 for coding. Gemini 2.5 Pro for big context. And even adds GPT 4.1 before Chad GPD has it. Chat with PDFs and PowerPoints. Then generate decks and docs and do deep research all in the same chat. Need human sounding copy? The humanized toggle rewrites text to beat AI detectors. Spin up agents and run code with AI engineer. I built my first bot in just minutes. Track artifacts. Create GitHub pull requests and debug from the same interface. Need visuals? No problem. Use Flux or Ideoggram and Recraft for images. Cling, Luma, and Runway for video, all builtin. And the kicker is just $10 a month, less than one premium model. Head over to chatlm.abacus.ai AI or click the link in the description and level up with Chat LLM Teams. What about this little guy? This is the M4 Pro Mac Mini and it's got 64 gigs of unified memory. Most of that can be used by the GPU. All right, here on top I've got the Mac Mini. On the bottom I've got the Windows machine. Let's load up that 4-bit model. By the way, this is the MLX quantization because MLX is a little bit more performant on Apple Silicon is designed for it. Boom. And there it goes. That is going pretty fast. Does that look faster than the other ones? I don't know. Let's find out. I'm going to stop it. 75 tokens per second, which is faster than the 5060 for sure. But it's also the 4-bit quant that's faster than the 8-bit quant running on the RTX 5090. I know it's a little confusing. I've got charts coming up, don't worry. Here's the 8bit version. Let's try that on the Mac Mini. So, this one, of course, should be a little bit slower. And it is. 52 tokens per second. 52 tokens per second is a lot faster than 31 tokens per second. So if you're dealing with an 8-bit quant of this model, this runs way faster on an Apple Silicon Mac Mini than it does on the RTX 5090 because of that CPU spillover when you're running it on the 5090. That's crazy. Of course, I've got the M4 Max and the M3 Ultra. Both of those are going to be even faster than that. But what about Windows and Linux people? What can they do to improve their lives here since they were hurt so badly by these model sizes? Well, they can always get two. So, let's say you have like a 5080 or a 5060 or 5070 and those have 16 GB each. You can add another GPU and extend that. But in order to be able to run the full 30 billion parameter model at 8 bit, you're going to need that 5090 with 32 gigs and another card. So, I've got one of the lower ones here. I've got a 5060 Ti with 16 gigs. Together, they have 48. First of all, you're going to need to get a motherboard that can support two GPUs. And then you're going to need to run software that can support that. LM Studio is able to do that. And I can control what goes where through the hardware tab. I can turn on and off each one of these GPUs. I can have them both on. Now, remember that 31 tokens per second, that's the number to beat. And I'll select that 8 bit quantization. 8 bits is going to supposedly give us better results than 4 bits. However, the quality testing is something to be done in a separate video. This video is just for speed testing only. And let's go. Okay, it's looking pretty decent. I like the speed. Both the GPUs are working. It's getting warmer. Let's take a look at what speed we're getting here. We're getting 50. So, we're getting closer to what that Mac Mini can do, but we're still under it in tokens per second even using both of these GPUs. There is one more setting we could change. There is this split evenly setting. And what that does is, let me show you. Here's the two GPUs right there. It's trying to balance out the amount of VRAM it's using from each GPU. It's using 21 out of 32 GB from the 5090. 32% utilization. And it's using 10 out of 16 GB on the 5060 with 32% utilization as well. So what's happening there is it's splitting the model. But we know that the 5090 has a lot more memory bandwidth than the 5060. So what we really want to do is load up that 5090 to the max before we spill over to the 5060. And you can do that in LM Studio. And we've got both the GPU selected, but instead of split evenly, we want to do priority order one and two. And we want to make sure that the 5090 is listed first. You can move these around, but the 5090 should be first because it has more bandwidth and it has more memory. So, let's use that to the max. Write a Java function to find prime numbers up to n. Boom. There we go. But is this fast enough? Let's stop that. 49 tokens per second. That's not great. For some reason, even though there's enough RAM here, enough VRAM, since we have to do a little bit of copying back and forth, that's taking a hit on the performance. Now, that's the 8bit version. Let's take a look at the 4-bit version. This should be a lot faster. And yeah, wow, look at that difference. We're seeing a lot of utilization on that 5090, almost no utilization on the 5060. Everything is pretty much running and fitting inside that 5090. So, there's no need to share the GPUs at all. So, what we're seeing here is really from the 5090. 157 tokens per second on this one. That is a huge difference. Now, a lot of you are interested in not just a little short prompt. So, I had to automate this whole process because it was going to take forever. It already took forever because some of these scripts that I wrote I had to run overnight. I ran them over and over and over again and I wanted to shove a bunch of different prompts. So here are some examples. Extra long programming codeheavy prompt which is 17,000 tokens. This would be an example of a really large context window. And I had to enlarge the context window of this model to 50,000. This model supports a huge context window. But of course, when you do that, you're increasing the memory footprint that's required as well. And you're going to see a little bit of a slowdown from what you've seen in the UI, what I just demonstrated. This one is 17,000. Here is one using repo prompt, which is a tool I've showed before on the channel. Really nice tool. Check it out. This allows you to basically create a mapping of your entire project or parts of the project and include that as part of the context. So, this one is 44,000 tokens. And then we got some smaller token size prompts like long architecture enterprise prompt, non-programming analysis prompt. These are long prompts. These are over a,000 tokens each. And then I did some medium prompts which are prompts that you would probably enter yourself in the range of 100 tokens or so. And then we got some really short prompts like design a microservices architecture. Short debug prompt. Fix and debug this little function right here. Simple greeting. Hi. By the way, I have to apologize. I've been using a lot of hi in my previous videos. And after doing this analysis, it's not a good prompt at all. And I'll show you why. Sorry to all those folks that had to deal with me saying hi. I won't do it again. Well, maybe I will. I don't know. Anyway, I shoved all these prompts one at a time, five times each. Here's what one of the results looks like. The long architecture prompt is on the left. The shortest ones are on the right. This little line right here, that's the variability. You want that to be as small as possible. So even after running that high prompt, the short simple greeting prompt five times, no matter how many times I ran it actually on all the different platforms, different models, I get a really huge variability. So it's not a very good one. These are used not through the UI, but through its REST endpoint. LM Studio exposes a REST endpoint where you can query it using an open AI style API which is most likely how you'd be using it if you were to use this in a code editor for example or with something like Quen code which is a terminalbased agentbased application. So this is what 8bit looks like using this setup with the RTX 5090 and 5060. There is a little bit of a difference between the long prompts and the medium prompts. Notice that the very short prompts have really poor performance speedwise. The fastest one was this short programming prompt. And the system completely failed when it came to the 17,000 token prompt and the 44,000 token prompt. Now, let's take a look at this. This is the 4-bit version here. We're running pretty much on the 5090 even though both cards were involved in this. And we're getting really good speeds of 163 tokens per second on all these. We're doing really well. Here's the M4 Pro. this little Mac Mini here. And we're doing pretty well here. Over 80 tokens per second on some of these. And here are the extra long prompts. These are the prompts that the RTX cards weren't even able to process, but little box could. Not very fast. 10 tokens per second there. That's the 44,000 token prompt. It still did it, which is pretty cool. And then 25 tokens per second for the 17,000 token prompt. M4 Max, even faster. 25 tokens per second for the 44,000 token prompt and up to 110 tokens per second for the medium programming prompt. But you know what surprised me the most? The M3 Ultra didn't do as well as the M4 Max. Here's the 4bit result with all of them combined. Of course, with four bits, the RTX 5090 just destroys it. So, if you're okay with that quant, that's the fastest result you're going to get. M4 Pro, M4 Max, and M3 Ultra are all down here. And let's take the fastest result, which is that M4 Max. And compared to the M3 Ultra, we're quite a bit off here. The M3 Ultra is consistently slower for this particular model than the M4 Max, which is kind of odd. Except for these really long prompts, the M3 Ultra is a little bit faster there, which is kind of interesting. But the M4 Pro and the M4 Max relation is kind of what I would expect. Now, if we take a look at the 8bit version of this, whoa. Still kind of the same with the M4 Max and the M3 Ultra. The M3 Ultra actually does better with 8bit, and the M4 Pro seems to do a little bit worse. But what's really worse is that RTXbased machine, the Nvidia machines. First of all, it didn't run the extra long prompts at all. And second, it took a huge hit from having to share that memory and go through system memory while it was copying back and forth and combining the results from the two cards. This definitely probably would have been better on that RTX Pro 6000. That is a very, very expensive card. And if you want to know more about that card, check out this video over here. Thanks for watching. I'll see you next time. [Applause] [Music]

Video description

Here's why “free” QWEN3 coder can end up costing NVIDIA users twice as much while Apple owners skate free. Check out ChatLLM: https://chatllm.abacus.ai/ltf 🛒 Gear Links 🛒 * 📦🎮 Mini Rack: https://amzn.to/4dXfwan * 💻🔄 The GmkTec EVO X2: https://amzn.to/4l5BHOh or if sold out try directly from them: https://www.gmktec.com/products/amd-ryzen%E2%84%A2-ai-max-395-evo-x2-ai-mini-pc * 🍏💥 M4 Mac Mini Deal: https://amzn.to/3ZVDfly * 🍏💥 M4 Pro Mac Mini Deal: https://amzn.to/3ZVDfly * 🎧⚡ Great 40Gbps T4 enclosure: https://amzn.to/3JNwBGW * 🛠️🚀 My nvme ssd: https://amzn.to/3YLEySo * 📦🎮 My gear: https://www.amazon.com/shop/alexziskind 🎥 Related Videos 🎥 🧬🐍 Mac Studio CLUSTER vs M3 Ultra 🤯 - https://youtu.be/d8yS-2OyJhw 🧳🧰 Mini PC portable setup - https://youtu.be/4RYmsrarOSw 🍎💻 Dev setup on Mac - https://youtu.be/KiKUN4i1SeU 💸🧠 Cheap mini runs a 70B LLM 🤯 - https://youtu.be/xyKEQjUzfAk 🧪🔥 RAM torture test on Mac - https://youtu.be/l3zIwPgan7M 🍏⚡ FREE Local LLMs on Apple Silicon | FAST! - https://youtu.be/bp2eev21Qfo 🧠📉 REALITY vs Apple’s Memory Claims | vs RTX4090m - https://youtu.be/fdvzQAWXU7A ⚡💥 Thunderbolt 5 BREAKS Apple’s Upcharge - https://youtu.be/nHqrvxcRc7o 🧠🚀 INSANE Machine Learning on Neural Engine - https://youtu.be/Y2FOUg_jo7k 🧱🖥️ Mac Mini Cluster - https://youtu.be/GBR6pHZ68Ho * 🛠️ Developer productivity Playlist - https://www.youtube.com/playlist?list=PLPwbI_iIX3aQCRdFGM7j4TY_7STfv2aXX — — — — — — — — — ❤️ SUBSCRIBE TO MY YOUTUBE CHANNEL 📺 Click here to subscribe: https://www.youtube.com/@AZisk?sub_confirmation=1 Join this channel to get access to perks: https://www.youtube.com/channel/UCajiMK_CY9icRhLepS8_3ug/join — — — — — — — — — 📱LET'S CONNECT ON SOCIAL MEDIA ALEX ON TWITTER: https://twitter.com/digitalix — — — — — — — — — #m4pro #rtx5090 #llm