Near silent LLM Monster... NVIDIA, take notes

Alex Ziskind · 235.3K views · 6.1K likes

Analysis Summary

30% Low Influence

mildmoderatesevere

“Be aware that the comparison focuses heavily on RAM capacity and 'openness' to justify the complexity of Linux/AMD setups over the more streamlined but expensive Apple alternatives.”

Transparency Mostly Transparent

Primary technique

Human Detected

98%

Signals

The content features a distinct personal voice with specific references to the creator's own historical work and custom-built testing methodologies. The speech patterns are naturally rhythmic and informal, lacking the formulaic or perfectly smooth cadence typical of AI narration.

Personal Anecdotes and Context The narrator references previous videos on his channel (Mac Studio cluster, Mac Mini cluster) and specific past testing experiences.

Natural Speech Patterns Use of conversational fillers and informal phrasing like 'wow, whisper quiet', 'not the first, not the second, not even the third', and 'Sorry, Jeff, mine is bigger'.

Technical Specificity and Original Research The narrator describes a custom automated testing process he wrote to gather 'statistically significant data' across specific BIOS configurations.

Self-Correction and Nuance The narrator compares his current Linux results to previous Windows findings, noting discrepancies and explaining the 'margin of error'.

Worth Noting

Positive elements

This video provides highly specific, empirical data on how AMD's new integrated GPUs handle large-scale LLMs (up to 235B parameters) which is rare in mainstream tech reviews.

Be Aware

Cautionary elements

The use of 'revelation framing' regarding CPU benchmarks (referencing DHH's tweets) creates a sense of 'insider knowledge' that may lead viewers to overlook the stability issues mentioned later in the video.

Influence Dimensions

How are these scored?

About this analysis

Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.

This analysis is a tool for your own thinking — what you do with it is up to you.

Analyzed March 23, 2026 at 20:38 UTC Model google/gemini-3-flash-preview-20251217

Transcript

The Framework Desktop is the newest machine to land with AMD's Ryzen Max Plus 395's chip that has 128 gigs of RAM, which is fantastic for local AI like large language models and image and video generation. And while they're a little late to the party, not the first, not the second, not even the third, they might have just nailed the best cooling solution yet. I've tested the Asus Flow Z13. Impressive for cramming this chip into this laptop. And the GMK Tech Evo X2, a tiny powerhouse. Gets a little noisy, but capable. But the framework board, wow, whisper quiet. Think Mac Studio level quiet. Speaking of Mac Studios, I've been testing some clusters on this channel. I did a Mac Studio video. I did a Mac Mini cluster video. You might have seen those. I thought it would be fun to stack four framework desktop boards just like the ones they showed off earlier this year when they announced the framework desktop. And while this is great, I think we can improve a few things. [Music] And we're ready for our first race here. I've loaded up quen 3 coder 30 billion parameter model at Q8. Q8 is the quantization. That means the model is quantized down to 8 bits from the original so that it could fit on smaller hardware like we have here. And this model is about 30 gigs in size. This is a medium-siz programming architecture prompt. Design a scalable web application architecture for an ecommerce platform. Let's go. And off they go. This is a pretty long response. Now, why am I racing the same LLM on four identical framework desktop boards? Because memory is king for local AI. And this AMD chip lets me flip between two GPU memory modes in the BIOS. There's dynamic, where the CPU and GPU borrow RAM on the fly, or fixed, where the GPU gets a permanent chunk. This is unlike Apple's always shared, or unified memory architecture. This split could mean the difference between smooth sustained performance and unexpected slowdowns. And this is something that's often confusing with the new RAM settings that I wanted to clear up right away. We got slightly different results, but within the margin of error here. 37 tokens per second, 42, 43, and 37. Now, I've tested different settings before when I was testing this machine, and I got different results. And in that video, I concluded that the auto setting in Windows is not the best option, but here we're on Linux. And when I did some testing using a bunch of different prompts from 17,000 tokens to 45,000 tokens all the way down to 100 tokens and these range from codeheavy programming prompts to architecture prompts, non-programming explanations and so on. Even a simple hello greeting is in there as well. I wrote this test and I automated the process of running these many many times over. So we get some statistically significant data. By the way, here this machine is set to 64 fixed. This one is set to 16 fixed. This one's set to 96 fixed. And this one is dynamic set to 512 megabytes. You can tell the difference here in the kind of prompt you issue and the size of the prompt too. This also matters what kind of model you're running. So there's a lot of variation. So for this particular model, Quen 3 coder 30B with Q8 quantization, the dynamic setting in the BIOS actually is faster for a lot of these. For example, this medium architecture prompt, we got 42 tokens per second for the dynamic setting. And the next lowest down was 41 for the 96 fixed setting. The worst one was 64 GB. You can see that yellow bar sticking up on most of these except for when you get to really large prompts right down here. The 96 GB setting wins out here. 19 tokens per second for the 17,000 token prompt. And for the 45,000 token prompt, we got 10 tokens per second. I just want to pause right here and say this is running on the iGPU, the integrated GPU. It's not one of these, folks. It's inside. Now, we've seen this, of course, from Apple, but we haven't seen such performance on an iGPU until this chip. And since we're talking about memory, bandwidth matters just as much as capacity. This is how the chip's bandwidth stacks up against the competition. Here's Ryzen AI Max Plus 395 at 256 GB per second. Not bad for an iGPU. We've got Apple M4, which is lower. Apple M4 Pro, which is higher. And then, of course, M4 Max and M4 Ultra have pretty high memory bandwidth, and it shows. I'm going to show you some charts in a bit. Then we have this one, Nvidia DGX Spark. Still waiting for that one. Where is it? But also keep in mind that this is the only machine on this list that's x86. So, it's going to run all your Windows and all your Linux is Linuxai. In other words, it's going to be super compatible with pretty much everything going forward and legacy. Speaking of Apple's M4 Pro, I ran the same model with int 8 quantization. And yeah, it is a bit faster than this desktop. For example, this uh short programming prompt, 47 tokens per second for the framework and 55 tokens per second for the M4 Pro. But that's for all these prompts. the really short prompts, the Framework Desktop wins. But of course, who's happy with the really short prompts? For the extra long prompt, the Framework Desktop won again. It beat out the M4 Pro, which is not a small feat. Another advantage of the Framework Desktop is 128 gigs of RAM, whereas the M4 Pro is capped at 64 GB, and that's only on the Mac Mini. Very expensive little Mac Mini. Now, if you want to go to 128 gigabytes in the Mac world, then you have to get either a MacBook Pro with 128 gigs, the M4 Max, or the M4 Max Mac Studio. That's not this one, but it looks just like it. Okay, prop. Nope, not the $2,000 one, cuz that's limited to 36 gigs of memory. Oh, yes. Apple will get you. You got to go to this one. And then you have the 128 GB option for $1,000 more. And then you're at 3,500 bucks. But wait, let's also get the same terabyte drive that I have in those framework desktops. Now you're at 36.99. Yeah, but to finish that thought, here's the comparison. Yeah, the M4 Max is significantly faster due to the memory capacity, the memory bus width, and that relates to memory bandwidth. It beats out the framework desktop in all the tests. So these days, I'm constantly flipping between models. GPT4 row for notes and email. Claude for code refactors. Flux for image generation. Cling for video. Four tabs, four bills, and counting. Enter chat LLM teams. There's one dashboard that houses every top LLM and route LLM picks the right one for you for a given task. 04 Mini High for fast answers, Claw Sonnet 3.7 for coding, Gemini 2.5 Pro for big context, and even adds GPT 4.1 before Chad GPT has it. Chat with PDFs and PowerPoints, then generate decks and docs and do deep research all in the same chat. Need human sounding copy? The humanized toggle rewrites text to beat AI detectors. Spin up agents and run code with AI engineer. I built my first bot in just minutes. Track artifacts, create GitHub poll requests, and debug from the same interface. Need visuals? No problem. Use Flux or Ideog and Recraft for images. Cling, Luma, and Runway for video, all builtin. And the kicker is just $10 a month, less than one premium model. Head over to chatlm.abacus.ai or click the link in the description and level up with Chatlm Teams. So, there's the GPU part of the story, which we'll get back to you in a moment. But there's also that CPU. It's 16 cores, 32 threads. Let me ask you a question. Can you hear anything right now? There's four of them running all on top of each other in this rack. I'll tell you what I can hear. I can hear that switch. Yep. But I can't hear any of those framework desktop boards. And this is what shocked me the most. All right. All right. I'm I'm getting back into that. I can't get distracted. Let's talk about that CPU, which is a powerhouse. Just a little detour, okay? Because I just saw this tweet by DHH. He says the framework desktop AMD 395 is putting out a measly 17,000 multi-core Geekbench score in the Virgin RS Technica reviews, but he got 3100 single core score and 25,474 multi-core score. So, I was like, what? How is that even possible? I mean, it's possible obviously, but I had to dig things up. So, here is the low Z13 score, 20,000. The M4 chip has a multi-core score of 15,000. The M4 Pro, 20,000. Okay. Okay, it's about there. Here's the M4 Max, 25,900. That's crazy that the CPU score is about the same as the M4 Max. Now, I ran this myself, of course. I got just over 23,000, but this was on Fedora and DHH was using Arch Linux. Maybe that has something to do with it. I don't know. But that's still pretty high. And I got that score on both the framework desktop and on the Evo X2. That chip's good. As a side note, this is my first time using Fedora, and I've got to say, I'm really liking it more than Ubuntu. Stuff just works well, mostly. Rockom install on bare metal was still rocky, which is why I ended up using Donato Capitala's toolbox images to quickly swap between Vulcan, Rockom, and even different Rockom versions without reinstalling the whole stack. The other way you can try out Rockom and Vulcan is in LM Studio. As a side note, if you don't know what I'm talking about, you want your LLMs to run fast, you want them to run on the GPU instead of the CPU, you have to provide that API to do parallel programming. And there's many different approaches to it. The two most popular for AMD are Rockom and Vulcan. And LM Studio detects this and installs these automatically, which is kind of cool. So you can run your LLMs on rockom, you can run it on CPU or you can run it on Vulcan and they all use Llama CPP as the backend which is a popular backend to use for inference. Olama also uses it but Olama at this time doesn't really use Vulcan or Rockom. I couldn't get it to work so it uses the CPU which in turn becomes really slow. Now I have Vulcan selected for LM Studio because I found that it is the fastest. In fact, there are certain models that wouldn't even load under Rockum at all in LM Studio. The really huge Quen 3 235 billion parameter model, which is 103 GB on disk. Believe it or not, that loads on one of these. I mean, why would you not believe me? I'm doing it right now. And there it is. It's loaded. I can start querying it. Boom. Not bad, right? Well, it runs under Vulcan, but it does not run under Rockom. I'm getting nine tokens per second there. One weird quirk that I found after I started testing the new GPT open- source models by OpenAI and those models are much smaller than this dense model or even the Llama 3.370 billion instruct dense model which load fine on all these and they are chunky 40 GB model that wouldn't load on any of the Nvidia GPUs that I have unless it's the RTX Pro 6000. I did a video on that one. That one's got 96 GB of VRAM. But yeah, even this 11 GBTE GPTO OSS 20 billion parameter model loads fine and it works pretty fast. 63 tokens per second for a short prompt like that. But for some weird reason, trying to increase the context length to something a little bit more useful than 4,000 tokens and the whole model crashes. Now, I found that this model with the extra long repo analysis prompt also doesn't work on the GMK Evo X2, even though they have both 128 GB of VRAM available. So, it crashes there. And this, by the way, is a mixture of experts models, so it doesn't use all the active parameters all at once, taking up a lot less space and running much faster when you're doing inference. So, why wouldn't it load? That's a mystery. However, the 17,000 token prompt did load on the GMK Evo X2, but did not load on the framework desktop. You might notice something else from this chart is that the EVO X2 is beating the framework desktop in pretty much every single instance of every single LLM run. Here is the 120 billion parameter model. GPTO says 120. Same thing. The Evo X2 beats out the framework in pretty much every case. Now, I had to test these out because they're so new. What about uh the ones we've been waiting for? The Quen 3 coder. I'm going to pick the fastest result here for framework, which was the 96 GB fixed version. Like I said before, it depends what model you're running, what memory configuration works best for that. And here we are going up against the M3 Ultra by Apple, the M4 Max, the M4 Pro. The M4 Max is really doing quite well, better than the M3 Ultra, which was surprising. But also keep in mind that those are MLX models running on Apple which are optimized for Apple Silicon. These are just running the standard GGUF models and I'd say they're doing pretty well. Here is the short programming prompt. 65 tokens per second. By the way, these are all on my GitHub. So, if you want to check out the actual prompts, I'll link to it down below. Here's the long architecture prompt. 52 tokens per second. Wow. Let's take a look at the GMK Tech Evo X2 here as well. And you'll see that they're actually really really close. 45 45 where they stray a little bit is on the longer prompts. I'd say the long to medium prompts, the GMK tech box actually wins a little bit. And these are not just one run. These are multiple runs, multiple iterations, just to get a statistical analysis here until you get to the really long prompts where the framework desktop actually beats out the GMK Tech Evo X2. And that might have something to do with Framework's cooling. It keeps it whisper quiet on par with Mac Studio, like I mentioned earlier. I'm just really impressed with it. That's why I keep talking about it. I've built a bunch of LLM geared PCs that I've started reviewing on the channel here, and this has been the easiest machine that I've ever built and also the quietest [Music] [Applause] So, as I mentioned before, Vulcan is your best bet right now on these ships. Here's why. I'm going to use Donato's toolbox. And you can just say toolbox list. It'll give you what I have installed on there. Here, I'm going to use Rockom. Over here on this one, I'm using Vulcan. And over here, I'm using Vulcan. But there are two different versions of Vulcan. I I didn't know that until I saw his video, which I'll link down below, by the way. Nice channel and good video. His toolbox already has Llama CPP installed so you can run inference against models and it's using the GPU on all these. So here it is running with rockom and I'm doing llama bench here so I can get the token generation time and the prompt processing time for this particular model on this chip. And here we go. GPTOSS, this is the 20 billion parameter version. We've got 566 tokens per second. PP512, which means prompt processing. And TG128 is token generation 62. Not bad, right? Let's go over to Vulcan over here. Running the same thing again. And I'm going to run the same thing on this one, too. Yeah. Interesting difference, huh? So, for this one, we've got token generation 70. Much higher than Rockum. And this Vulcan also gives us 71, which is pretty close to 70. So they're about the same. But look at the prompt processing speed. Here we've got 575. And over here we've got 1,163. Much faster. What's the difference between those two Vulcans? And this is where I would insert a Star Trek joke, but I don't have anything right now. Put one down below if you think of something. Well, this one is Rad V, which is a community developed version of Vulcan. And this one is AMD open- source driver which is developed by AMD themselves. Now this is just an example for this particular model and the differences that it has. New version of LM Studio always a new version whenever I open it. I'm going to go to LM Studio and say write a paragraph over here and I'm getting 63 tokens per second which is slower than both Vulcans over here. It's closer to what Rockom is giving us for this particular model. Yet I do have Vulcan selected for my runtime. It doesn't say which version of Vulcan this is using and I was going to try and guess but now I can't because the number is much lower than I'm getting with Llama CPP directly. Now let's try the most dense model that I have which is Llama 3.370 billion. I just want to come back to this for a second. These are now running. They're running on the GPU. If we take a look at the GPU, you can see what's going on right there. That's pretty crazy. Oh, I've just started hearing the fan. But that's like just now after all this, I've just started hearing the fan and it's still not ridiculously loud. So, props to Framework for getting a machine like this done. Oh, this one's taking a while. You can imagine that this is pretty much what it's doing right now on those other three machines. Here, I've got Llama 3.370 billion parameter model pulled up in LM Studio, and it is pretty slow. It's going at 4.94 tokens per second. Let's see what we get with Llama Bench directly. We got some results for Vulcan. Both of them 72 tokens per second prompt processing and five tokens per second token generation. Here also five token generation, 46 prompt processing. But the rock implementation is not going anywhere. In fact, this is something that Donato mentioned in his video. He said that for some of these models, it could take forever to load using Rockom unless you use a special flag called Nom Map. Well, apparently Llama CLI has that flag, but Llama Bench doesn't. So, we're kind of stuck. That's where Rockam is right now, folks. Hopefully, it keeps getting worked on and improved, but it's nice to have a couple of different options. Now, I want to get back to the question of the board itself. This is framework after all. Those are the people that, you know, give you laptops. You can change the board out. You can change the the cards out, the ports, the RAM, and you can't change the RAM on this. You can pretty much do everything else. You can put your own fan on. You can put your own power supply, but the RAM is soldered onto the board. It's part of that whole APU package for that Ryzen chip that's inside. Isn't that against what Framework stands for? Here's Nurav, the CEO of Framework on Linus TechTips just a few months ago. So we did actually ask AMD about this. The first time they told us about strick was actually literally our first question. How do we get modular memory? We are framework after all. And >> the other guys linus by the way >> and they didn't say no actually they did assign one of their technical architects like really really go deep on this. They ran simulations. They ran studies and they just determined it's not possible with strict hail to do LP cam. the signal integrity doesn't work out because of how that memory is fanning out over the 256-bit bus. >> Oh well, that's bad news. >> Now, Lionus has to say that because of his core audience, of course, but he knows his investment in framework will pay off big time. Apple has had unbelievable success with soldered RAM. And the numbers, both the bottom line of the company and the benchmarks don't lie. Yes, this is good for business because it's going to give them better products, but it's also good for us. Some may argue that and feel free to do that in the comments below, but I think that performance is king in my world. And I would rather have a machine that performs faster doing these kinds of workloads and developer related workloads than having the ability to change the RAM out. Amazing transparency on frameworks part though. So yeah, technically it's a trade-off, but I think it's the right one. The rest of the system still keeps Framework's ethos intact. I didn't even get the full desktop. I just got the boards. And that means I could pick my own case, my own fan, built it my way. There's two SSD slots, so you can use whatever SSDs you want. By the way, the SSD that Framework sent me along with all this is the Western Digital variety. And the speed is incredible. It lives up to its label. In my opinion, the board is actually a good buy because I don't need the modular connections that come with the case that comes with the desktop. board gives me everything I need. It's got two display ports and one HDMI. So, a total of three connections directly to a monitor and two USB 4 connections right on the back along with two USBA connections. The board also has expansion that you can plug in more USBs if you need that in your own case. I demonstrated how easy it was to put this thing together and even swap out the SSDs in a live stream recently. Check that out if you missed it. If you don't need their modular desktop chassis, you can save a lot with a budget case and still keep things whisper quiet. But I haven't set up the cluster after being a little bit discouraged when I saw Jeff Gearling's video and he got only.7 tokens per second, which is completely unusable. But I think there's still opportunity here. You saw me running the 235 billion parameter model that takes up 111 GB while it's running on one of these. Now, you could use two of them and run a slightly bigger model, but the more nodes you add to a cluster, the slower it's going to get because you're bottlenecked by network. I have a 10 GB switch here, so these are getting nice fast Ethernet, but even after connecting these together using USB 4, it's limited by the network manager in Linux, which caps it at 10 GB also. So you're not getting the full speed that you can through Thunderbolt 4 here and therefore you're not going to be able to run the huge deepseek or even quen 3 coder 480 billion parameter models by throwing more nodes at it. Bigger models yes faster definitely no. I'd stick to running one of these at a time as it's a pretty good machine on its own. But definitely check out Jeff's video as he goes into more detail on his cluster setup. And if you want to see my Mac Mini cluster setup, watch that video over here. Or my Mac Studio cluster setup, watch this video over here. Thanks for watching and I'll see you in the next one. [Music]

Video description

This is how AMD's Ryzen AI Max+ 395 should be done - a whisper-quiet 128GB powerhouse that’s built for local AI, with the Framework Desktop boards. Check out ChatLLM: https://chatllm.abacus.ai/ltf 🛒 Gear Links 🛒 📦🖥️ Mini Rack: https://amzn.to/4dXfwan 📦📏 Taller Mini Rack: https://amzn.to/4lKdCwb 🌐⚡ 10Gb switch: https://amzn.to/4mxHxsL 🖥️🎛️ Rackmount monitor: https://amzn.to/4mAByDB 💽🔗 Great 40Gbps T4 enclosure: https://amzn.to/3JNwBGW 🔧💾 My nvme ssd: https://amzn.to/3YLEySo 🛠️🛒 My gear: https://www.amazon.com/shop/alexziskind 🎥 Related Videos 🎥 🖥️📡 Mac Studio cluster - https://youtu.be/d8yS-2OyJhw 🖥️📡 Mac Mini cluster - https://youtu.be/GBR6pHZ68Ho 💻🔥 This Laptop Runs LLMs Better Than Most Desktops - https://youtu.be/AcTmeGpzhBk 💻🔍 Full Z13 review - https://youtu.be/fGEqxHurxZM 🔋⏳ How long can they last? | ULTIMATE BATTERY TEST - https://youtu.be/u1XJAOf_W5w 💻🛠️ Set up laptop for Software Development - https://youtu.be/3mCZ3WUcM8s 💻✨ 15" MacBook Air | developer's dream - https://youtu.be/A1IOZUCTOkM 🤖⚙️ INSANE Machine Learning on Neural Engine - https://youtu.be/Y2FOUg_jo7k * 🛠️ Developer productivity Playlist - https://www.youtube.com/playlist?list=PLPwbI_iIX3aQCRdFGM7j4TY_7STfv2aXX Donato’s video: https://youtu.be/wCBLMXgk3No?si=Xo_rEWoDCQ7oVyAP My prompts: https://github.com/alexziskind1/machine_tests/tree/main/ml/auto_prompter/prompts — — — — — — — — — ❤️ SUBSCRIBE TO MY YOUTUBE CHANNEL 📺 Click here to subscribe: https://www.youtube.com/@AZisk?sub_confirmation=1 Join this channel to get access to perks: https://www.youtube.com/channel/UCajiMK_CY9icRhLepS8_3ug/join — — — — — — — — — 📱LET'S CONNECT ON SOCIAL MEDIA ALEX ON TWITTER: https://twitter.com/digitalix — — — — — — — — — Sorry, Jeff, mine is bigger. #gmktec #framework #llm