We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Analysis Summary
Performed authenticity
The deliberate construction of "realness" โ confessional tone, casual filming, strategic vulnerability โ designed to lower your guard. When someone appears unpolished and honest, you evaluate their claims less critically. The spontaneity is rehearsed.
Goffman's dramaturgy (1959); Audrezet et al. (2020) on performed authenticity
Worth Noting
Positive elements
- This video provides a rare look at how the massive 397B Qwen model performs on consumer-grade (though high-end) Apple Silicon hardware.
Be Aware
Cautionary elements
- The 'benchmarking' is highly integrated with the promotion of the creator's own software, making it difficult to separate model performance from software optimization.
Influence Dimensions
How are these scored?About this analysis
Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.
This analysis is a tool for your own thinking โ what you do with it is up to you.
Related content covering similar topics.
MacBook Pro M5 Max vs M4 Max Benchmark Results Are INSANE!
Matt Talks Tech
This Shouldnโt Be Able to Run 120B Locally
Alex Ziskind
MacBook Pro M5 Max: What Apple Is Really Building
Bobby Tonelli
Transcript
Hey guys, welcome to the show. Today we're checking out all of the familia of the Quinn 3.5 models that have just been released. So we have a look at this one. We got the large one, the large one in charge. 3.5 397 billion parameters, 17 active billion parameters and is a big one. They're competing against the chat gyps, the cluds, and they are very very topping the charts. But interestingly enough, they just released new ones and these are smaller 122 billion parameters. They've even got 35 billion one and they even got a 27 billion one which is actually smarter than them both. Crazy as you can believe it. So just looking at these charts over here. So we got the 3.500 billion, you got the 27 billion and you got the 35 billion. About a billionaire, isn't it? I'm a billion token air billion parameter air. And you can see that they are leading the charts over here. They're even beating chat GBT 5 mini. And the one that's beating is actually the 27 billion parameter one. So if you don't have all of the memory in the world, that is the smartest one to get. And the other ones, you can see that the 100 billion one is just below it when it comes with the ifbench instruction following. So if you tell it to do something, will it follow your commands? That's very, very important. But when it comes into software engineering, you can see that they all are chart topping. But again, the 27 billion one, you know, I need to upload the Q9 version of that one. But the 27 billion one is topping the charts over here. Very, very smart. And something also interesting, if we go into our artificial analysis kind of like a benchmarking website, I need to get into this benchmarking game to actually show the comparisons properly rather than doing this stuff live. Anyway, you can see that GLM is leading the pack up to all the open source models and you got Kim K25 and then boom, you got Quen 3.5 and then you don't see any more Quins. However, if you actually look at the scores, you might actually see something slightly different. So you get 45 with Quen 3.5. And then the next one still in the top 10 is Mimo and that's got 41. Again, no other mention of the Quinn 3.5. However, if you go into mediumsiz models here, we can see that the Quen 3.500 billion one, the 122 billion one, it's actually also got 41. And if you even go to the small models one, you can see that the 27 billion one has actually got 42. So if this all open source models list was true, you'd get all of the models in the top 10. So that's just how amazing this release is. And something to top that as well, I've actually been uploading all of the different permutations of the models to fit different memory sizes. And in the latest version of Imprinter, if you go into models over here, and if you do a search, for example, I'm going to type in imprinter quen 3.5. I'm going to turn the filter off. I'm going to hit search, and that's going to tell you all of the models that are there. So I've got a 445 gabyte model. That's the biggest one at 9 bit quant. However, new feature. Look at that. You press this filter button over here and it straight away removes all the models which are too large for my computer. So the largest one I can actually run is at 99 GB. It's crazy. At least 99 gigabytes cuz gigabytes is base thousand whereas Gibbs is 1024. What the true number is Gibbs. Okay. The the GBs has been taken over by the storage consortium. Look it up. Okay. If you don't believe it, look it up. So, but now the cool thing about it is you you press that filter on and the filter is you can filter by MLX because Infer can run both standard safe tensors. So, if you just want the unquantized version, you can actually run that if you got enough memory or if you use model streaming which just streams from the storage and also filter my memory. So, it kind of it gets rid of all the models which will be too large for your computer. So, boom, we got that new feature there. So, you can go check out which particular model will run on your system. But in this one, we're actually going to be comparing the the full fat, the super large one against the smaller ones and just seeing how much of a loss we're going to get and the speed and all that kind of crazy stuff that happens over here. Some examples that we're going to be running with all of the different versions. So, I'm going to try and do it as fast as possible so this video doesn't get too long. But this is, look at that procedural world generation. We change the time of the day, generate a new world, change the atmosphere density, change the sea level, mountain height. This is just beautiful. like you can make Mars. It's it's just gorgeous, beautiful visualization. That is the Q9 quant. Now, I'm going to jump into the next one down on the list. It's actually the 4.1 bit quant of the large one. Now, anything below 4.5 tends to break coding cuz coding is very very structured and when you start playing with that quantization, making it really really heavy, it starts making some mistakes. I'm getting into this world and I'm going to come up with better quads and better solutions for this. But this is the state of play as it is. If you did want to run the the large one and you wanted it to fit inside 256 GB RAM, that's kind of like the solution I've gone for. You can check out the 4.1 bit. If you're looking for coding, you're not going to get that good of a result. I'm sorry, but this quant does produce better quality than the MXFP4 version. And it's going to be writing head and it looks like it's writing code. It looks like it's going to do something good. We're going to see what the end result is going to be. While it's doing that, but while it's doing that, what you should be using this model for is more like for creative writing or even tool calls. So, I'm going to do a tool call over here. Go on Wikipedia and tell me all about flat earth. So, boom. It's gone on Wikipedia. It's downloaded the flat earth page and it's going to summarize it. And we're going to look at 26 tokens a second. And that's just not one inference going at the same time. That's two inferences going at the same time. It's writing out the code as well as telling me all about flat earth. and something I'm actually going to check out here. We can see that it's got some sort of facts. It's got about 330 BC and it's got a stat there. 82% of Americans. So, I'm going to search for 82%. And we can see boom. It says 82% 18 to 24 year old Americans agreed with the statement. And over here, we can see that 80 world is round. So, it summarized that the world is round 80%. So it's while coding isn't going to be good for this 4.1 bit quant probably you want just want to go for the 100 million one you're going to find that if you are doing kind of like textual based stuff you can still get some reasonably good intelligence out of it and that is just to fit within that 256 memory budget. So we can see that it produced 550 tokens here at 24.8 tokens a second. That's two inferences going at the same time and we used 191 gibs and that is given us 50 GB to play with in our memory. We can have context window and by default on max depending on the Mac OS version it could eat up 25% of it just as a safety buff buffer and you want if you want to use more than that you have to start using some pseudo commands and all this kind of like crazy stuff. Let's see how this code has being produced and as you can see it is writing a healthy amount of code. we got like 4,000 tokens is spewing it out. You know, just the little things here is it produced the code but then it ended up given a think tag and then it summarized what happened. So it got a bit confused as what was going on in the world. So I'm just going to grab this code and I'll show you it running right now. So Q4.1. So it's got this left side of the screen situation the same as Q9. However, it does have a syntax error. Says legal return statement. So, it did fail on that one. Now, I did produce lots of different other demonstrations for this already. So, I can see that the Q wiki Q2 I did manage to get Tiddly Wiki going and you can edit it. It doesn't look so good when you hit the edit button, but you can write this. You can hit save and it does persist over the screen. So, even though it's broken at coding, you can still do some sort of situation for this. So, that is something to be aware of. So, I'm going to come up with this this prompt. This is going to be my key prompt since it's a challenging one. And I'm going to go with and I'm going to go with the next one down, which is the Q9 version of the 122 billion parameter one. And this one is kind of like similar to maybe solar open if you heard of that or miniax cuz it has 10 billion active parameters. With miniax though, it has double the amount of parameters. So, it should have deeper intelligence but the same sort of speed cuz it's 10 billion active parameters. And we see here this one is we're going 40 tokens a second. We're doing a single inference over here and it's writing other stuff. Now, at the same time, I'm going to hit it with thinking enabled and the car wash question and the car wash question just to see if it can actually if there's something there is just spewing out a training data. So, the car wash is over 50 m from my house. I want to get the car washed. Should I walk or should I drive? It's thinking away. We're going 24 tokens a second each. So, 48 tokens second. Two inferences happening at the same time. Very, very good stuff. If you want to be running this stuff locally, it's kind of like a server. And memory wise, while while those two inferences are happening at the same time, I'm using 171 gigabytes in total because I'm probably doing some background stuff. But we got 130 gigabytes. So, it's just over the 128 GB limit. Well, 128 GB, you really don't want to be using anything more than 92 cuz you want to have 25% of the buffer, context window, all that kind of stuff. And Mac OS doesn't actually give you the full thing like I said before. But, so this will need if you got 256 system or 192 or anything like that, you're going to be loving this model. or even you can even do a bit of model streaming because it's pretty fast model as it is. But we can see that used 130 and it was smart enough to know that you should drive the car. So that is a win on logic. Let's see how it handles codes. Now to also make this interesting, I'm going to be throwing in some agent work in the background. So I'm going to load up quen 322 billion Q9. Hit save. Go back. I'll say what files do I have? So it's making the API request over here to my server. It's going to start prompt processing away. And the very first time it makes a request, it's going to be a bit slow just to cache it. But if you have prompt caching enabled, the next time you do a subsequent request, everything's going to be super fast. I'll just show you that. I'll hit cancel. I'll hit plus and I'll say what files do I have. And straight away, it's going to come up with a response. See, boom, it started writing out the responses. And we're still inferencing away on our situation. I can also hook up open clause. So I'll get config over here and I'll copy paste I'll copy paste my model. So I'm placing this one. Paste it in there. You got to change it in three locations. Paste it in there. Paste it in there. Hit save. Go back. Do new session. And again the first time round it's going to prompt process cuz it sends a massive 60,000 character system prompt. But then the next time round I hit new session. It's just going to start answering straight away. So we got all cached. And finally, let's see how it did with the Earth Pro. And look at that. It made the exact same visualization as the massive one. So that is that is gorgeous. Like does it is the same earthlike Marsike was kind of gorgeous. This one here is very very similar. Very very similar. Very very similar. Struggling. You even got a moon there. So it was a fantastic generation at 120. And if you just look at the memory requirements for this one, it is produced five eight 8,800 tokens, 32 tokens seconds, and I was running other inferences at the background. So, it's even faster than that. And memorize will use up 131 gibs. So, that's fantastic. What I'll do is as well, I'll switch over now to the Q6.5 version. And let me just run that test one more time because with this version over here, the Q6.5, that's going to fit in systems that actually have 128 GB RAM. Now, you might not have a 128 GB RAM system. Don't worry, we're going to go smaller. And the 27 billion parameter one slower, but it should be more intelligent. Apparently, according to the the test that they've done already, the benchmarks. So, this is producing code over 40 tokens a second. 44 tokens a second just there. It's spitting out the code. I want to see how much of a quality loss we're going to get. Is it going to be able to make this generation just as good as a Q9 one? Like, are the 128 GB RAM people going to be happy? Usually 6.5 bit quants are very very good. So let's just see how it goes. But this is a very very advanced prompt if you can't tell. At the same time I'm going to jump into kilo code again and just show you. At the same time I'm also going to ask it the comprehension question. So it figured it out at Q9. So I'll hit that run again and it's processing two prompts at the same time. We got 29 tokens a second and we also got 29 tokens second or 30 tokens second each. So 60 tokens a second. Two inferences going at the same time. I'll also hit with some kilo code. Again, the first time around you here, it's going to start doing that prompt processing situation, but if you have that cache enabled, it's going to save the disk and you're going to be happy the next time around. As you can see, everything's now infinging away altogether. I'll hit cancel and I'll start a new task and what files do I have and straight away it's answering the question. It's outputting a code and it also said that you should drive. So, we didn't lose that comprehension that you can't wash your car if you walk to the car wash. you have to actually bring your car with you. Now, some big models, they fail. Like, if you go three bitcoins, you're probably going to fail on that. So, 6.5 Bitcoin is still good. We're getting some agents happening in the background there. You can hook up a bit of open claw if you want. And we're producing over 3,000 tokens here. And memory wise, we are 96 GB. That is a comfortable amount of memory to get 128 GB for 128 GB system. So happy with that situation. So we ended up producing over 7,000 tokens at 37 tokens a second. That's with the upper infrances going at the same time because we got the feature called batching enabled. Getting a lot more solid in the next version. It's also going to be a lot more solid and reliable. So stay tuned for that. And our Q 6.5. We can see that we're starting to lose a little bit of quality here. Time of day you can change height. You can modify stuff. Detail you can change like that. So I'm going to hit the runtime errors. And we can see that we actually do have some errors here. So what I'm going to do is I'm going to copy this error. And I'm going to give it a second chance because I did lomize it slightly with Q Q6.5. So I'm going to say this the planet is white. So it said it knows why the error is there. It says cuz distance in GS GLSL requires both arguments to be the same dimension. VEC 3 and vec 3. In the code, we have vect 3, but u impact is ZW, which is a vector 2. So, it's going to output the 8,000 tokens again. I should have just told it, tell me the line of code to change. Matter of fact, I'm going to do that. I'm going to clone this conversation and I'm going to say, tell me which line of code to change. It's saying, grab this bit of code and replace it with this bit of code. That makes it a lot easier. And it's now black. But look that the runtime error has gone. The screen is now black. So I could probably say no more error, but the planet is black. And we're getting closer. So we see every single change we're getting a little bit closer. But as you can see, if you did just get the Q9 version, you're going to get the ultimate quality, especially for advanced prompts. So logic comprehension, getting Wikipedia articles, creative writing, you're going to get good results with Q 6.5. But if you want coding coding, then you start to degrade as you can see. But let's jump in to the next one down on the list, which is the 35 billion parameter, 3 billion active parameter at Q9. So I'm going to hit with the advanced prompt as well as the logical conundrum. And here, look at that speed. We're going over 50 tokens a second with two inferences going at the same time. So probably 110 tokens a second with two. Probably add in the bit of kilo code as well. I'll change the settings over here. Point it to that one. So 35 just progressing away and now it's done. Next time I run it, what files do I have, it's going to be cached and it's going to give me the response straight away. Boom. Just like that. Now over this one, interestingly enough, it's doing a lot more reasoning. We've got over 2,500 tokens and it's it's not in a loop just yet, but it's coming coming up with every single purpotation of what could happen about the car wash. And it might be in a little loop. So, what I might do here is I'm going to set the temperature from zero, which is good for coding. I'm going to set that to one just to help it break out of that loop a little bit. But when it comes to batching, you need to make sure that your temperature or your inferencing parameters are the same so you can combine them together. I'll work on getting temperature different because it's definitely possible, but that's the situation as it is today. So, I'll have that lined up next. And I'll also just to speed things up, I'll keep temperature at zero, but I'll ask it with thinking disabled. And with thinking disabled, look at that. It says you should drive your car to the car wash. So, it might have already been trained on this one. I'm guessing it's a very, very new model. Very, very popular prompt. So, maybe it's been trained on this one, but it's good because some models, for example, Chat GBT include when I asked them, they said, "You should walk." And they're the smart ones. So, we produced 9,000 tokens for Earth Pro. I'm going to copy that in there. I'll just check out the memory usage so you're aware it's using 38 gigs memory. So, this is just for a system. So, if you got a system that's 64 GB RAM, you can run this model at Q9, highest quality you can get. And it does have that same sort of runtime error. So, yeah, you're going to it's not smart. I suggest maybe checking out Quinn 3 coder next. That's also using a very similar architecture to 3.5. 3.5 adds in image recognition which I haven't implemented yet. And there's also a long cat flash um light. That one we checked out in the last review. Very very intelligent model. But yeah, we're starting to get some errors when you go down to that low quant. But there is a solution for this and there's two solution actually. I've got a Q 5.5 bit version of this model if you want to play with it. Again, you're going to start getting a bit more errors. But if you want to just ask it questions here. For example, it said you should drive the car when I set the temperature to one. Let's just load that model up quickly. Q5.5. Let's just see how much you lose. Going to ask it again. Should you drive or walk from the house? He said you should drive. So, it still has a bit of intelligence in there. And the memory usage of this one was 22 gigs of memory. So, if you got a 32 GB RAM system, you'll be able to run this model and you're getting some smart answers there. But the next step down, which actually a step up, is the 27 billion parameter version. So, I've got 27 billion with Q9. So I'm going to run that with the same massive prompt. But you can see we're running two inferences at the same time. 10 tokens a second. And according to the statistics, the 27 billion 27 billion version actually gets 42 in intelligence which is more than the 122 billion parameter version which gets only 41. So let's see what it does. So right now we're doing two inferences at the same time. Our memory utilization is 30 gigs of memory. So, it's you're going to need slightly more than 32 GB RAM to run the system. I do have a Q7 quant for this for people who only have up to 32 GB RAM. So, you're still going to get good intelligence, but you can start making those little bit of errors. So, we'll finish right there. And it says you should drive the car there. So, it knows comprehensive, very, very smart. And it also produced 9,000 tokens for this this prompt. Okay, it's a more detailed prompt than I usually do, but very, very, very intelligent. What will it actually look like? But before I do that, I'm actually going to do this prompt again one more time, except I'm going to run it with the 7 bit version. And that one would fit into 32 GB RAM systems. And this was a 35 billion parameter 1. This was the 100 billion parameter one at Q6. And this was 100 billion one. That looks very, very good. And this is the Q9, the 400 plus billion parameter one. The 27 billion one. Look at that. Look at that. It's quality. Look at that. You got the time of day. You got zoom, you got surface detail, atmospheric density. This is a really good version. You got Mars over here, the Mars planet. What you don't have on the bigger ones, as you can see, those ones had mountains, mountain height, and sea level. So, you're missing those parameters there. But, I'm sure you can prompt it into it cuz it's just more like a seed factor. It just chose to do random things cuz different the day that today is on all these different factors. But you can see it's a definitely a great demonstration. It's got an alien planet. So that is beautiful. 27 billion parameters, 64 GB RAM systems. You can run this super intelligence and uh that is good. Now the question is does it beat the 100 billion parameter one? I'm not sure. I'm not sure. I don't think I can say that. I know point analysis they say it beats it by one point, but I'm not sure yet. Potentially we're going to do some more tests. We're going to find out. But the 100 billion parameter one runs faster, three times faster than 27 billion parameter one because the 100 billion parameter one is a 10 active billion parameter situation. Whereas this one is a 27. So three times slower. And we can see our little friend here. We're producing two images at the same time. We're going to 20 20 so a second each. So we're going 40 to a second for both of them. Very very fast. And memory wise 27 12 GB. And that's with two. So you only have one inference is going to be less. And that should hopefully fit into a 32 GB RAM system. If it doesn't, let me know. I can always do a specialized version particularly for your case. And I am working on more intelligent ways to compress the models as well. So stay tuned for that stuff. Final decision, the answer is drive the car. So we've quantized it down to Q7, reduced our memory footprint, and we got the correct answer here. And we can see with Q7 we are producing over 3,000 plus tokens. It's all working. You can also hook up Kilo code like we showed. You can hook up open claw to them as well. And actually one thing to note is especially when the quantiz versions rather than just asking the the coding question there. You can actually ask the question via a coding agent and if coding agents are any good or worth their code. If they're not, let me know and I can always sort you out with something better. What they should always do is they prompt the model. They get the response from the model. They should do some some basic syntax checks just like Xcode does out of the box. just basic just compile checks, make sure it works and if there's any compile errors, it should then reprompt the model and get the actual answer in the end. That's the whole point of having an agent. So if there are little errors here and there with the output of the quantized version, running it for a code agent should yield a better response if it is worth the the the code that is prompting. Let me know and we can always improve that if you are running into is any issues. But overall, what do you guys think of the crazy situation they called Quen 3.5? We ran the massive massive 400 billion parameter plus version of the model. We ran the 100 billion parameter one, ran 27 billion parameter one, top quality one, that one and we also ran the 35 billion parameter one. And we got a whole family of models for you guys to play with. And if you are playing with it inside inferenc button here and you can filter by memory. So that means it will only show you the models that could run on your system. Now if you have lots of applications open and you know you just check out your activity monitor like for example me I'm using OBS I'm using 100 gigabytes of RAM. So even though this the models they can run on my system I have to close down all my applications to get it running. So just also be aware of that if you have any background applications. But let me know what you guys think. Coin 3.5 I think it's amazing done a great job. Smash it out. And if you don't know they make some amazing image generation models as well. They make Quentin image and they also make Z im Image Turbo. That's Thank you for your amazing work. Anyway, let me know what you guys think. Hope you guys found this video useful and enjoyed the show. And in case you're wondering, the Q7 version made 7,000 tokens. And this is the version it outputed. The Earth looks a bit blobby, but it's running. There's no errors. We got lighting. Got atmospheric density. So, we got all these levels potentially. It looks there's lots of potential there at Q7. that his play is good.
Video description
Qwen are back with their hero model Qwen-3.5 featuring chart topping results and a great selection of sizes to choose from. So let's see how they compare when running locally. TEST SYSTEM Inferencer App: https://inferencer.com Models: https://huggingface.co/inferencerlabs/models?search=qwen3.5 Parallels Virtual Machine: https://vtudio.com/a/?parallels 2025 M3 Ultra Mac Studio | 512GB RAM SPONSORED by xCreate Local AI Image Gen: http://xcreate.com BUY NOW Mac Studio: https://vtudio.com/a/?a=mac+studio MacBook Pro: https://vtudio.com/a/?a=macbook+pro LG C2 42" Monitor: https://vtudio.com/a/?a=lg+c2+42 NAS Drive: https://vtudio.com/a/?a=qnap+tvs-872xt COMPANION VIDEOS Qwen3.5 397B: https://youtu.be/tzF8jv3VGAg Kimi K2.5 Local Cluster: https://youtu.be/JM41u7emnwo Kimi K2.5 with OpenClaw: https://youtu.be/ExzAiMjT6jg FLUX 2 Klein: https://youtu.be/K6DiPiDgDds SPECIAL THANKS Thanks for your support and if you have any suggestions or would like to help us produce more videos, please get in touch. Links to products often include an affiliate tracking code which allow us to earn fees on purchases you make through them.