We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Analysis Summary
Ask yourself: “Did I notice what this video wanted from me, and did I decide freely to say yes?”
Performed authenticity
The deliberate construction of "realness" — confessional tone, casual filming, strategic vulnerability — designed to lower your guard. When someone appears unpolished and honest, you evaluate their claims less critically. The spontaneity is rehearsed.
Goffman's dramaturgy (1959); Audrezet et al. (2020) on performed authenticity
Worth Noting
Positive elements
- This video provides a highly practical explanation of speculative decoding and offers a genuine open-source tool (Draftbench) to help users optimize their specific hardware configurations.
Be Aware
Cautionary elements
- The use of a scripted 'skeptic' voice to rebrand established technical terms ('speculative decoding' to 'guess and check') is a subtle way to claim intellectual ownership over a standard industry technique.
Influence Dimensions
How are these scored?About this analysis
Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.
This analysis is a tool for your own thinking — what you do with it is up to you.
Related content covering similar topics.
Transcript
All right, you're going to like this. Watch this. Metal llama 3.170b instruct 8bit quant. I'm going to paste in this prompt. Write a Python class for a binary search tree with insert, delete, search, and in order traversal methods. Boom. This is running on an M4 Max MacBook Pro, by the way. Pretty fast machine, but this is a very dense model and it's also 8bit, so it's not going super fast. Let's fast forward after this whole thing finishes. I got 6.31 tokens per second. I would say that's pretty close to unusable range. But here I am running the exact same model on the exact same hardware with the exact same prompt, but it's going two times faster. Here it is. It's finished. And I got 12.26 tokens per second. How did I do this? It's called >> Whoa, whoa, whoa. You're not going to call it that, are you? People are going to click off the video if you call it that. >> Why? Most of the people watching already know what that is. Sounds like some kind of firmware update. >> Yeah, it's pretty terrible. >> Asotic complexity is also a term, but doesn't mean you'll lead with that. >> What do you suggest we call it? >> Call it what it actually is. Guess and check. A small model drafts ahead and the big model checks it. >> Or how about draft and verify? >> No, no, no. That sounds like a tax form. Fine, guess. >> Yeah, much better. So, what we need is to find the perfect guess and check models that work together well. And a lot of the tools that you already use for generations support this. LM Studio, Llama CPP, VLM. Here's just a taste in LM Studio. But we'll come back to Llama CPP in a moment. I can do that just by opening up chat, selecting a model, and adding a draft model. But wait a minute, what kind of draft model can I add? Cuz there are a lot of different models out there. Which ones are going to be more suited to speed things up for you? Because some will not speed things up for you, as I'll show you in a bit, will actually degrade your performance. Like this one right here, the speed up is negative 38%. Yeah, you don't want to do that. So, of course, I had to go write a program to help you find the sweet spot. AI helps me move fast, but it can't replace the fundamentals, so I keep investing in them. Meet boot.dev. Hands down the most engaging way I found to learn back-end web development. It's basically an RPG for coding. Quests, XP, leaderboards, and built-in AI guidance so you can keep moving. Projects and courses in Python, Go, and JavaScript align to real back-end tasks. APIs, O databases, testing, caching, everything that matters for building real services. There's a helpful Discord, clear answers when you're stuck, and a curriculum built for real jobs. The best part, you can browse lessons for free. Yeah, free. A membership unlocks interactive coding, AI hints, progress tracking, and a game layer. I used to binge tutorials like a Netflix show. Great entertainment, zero muscle. Boot.dev turns that time into practice. that ships real services. Hit the link below and use my code for 25% off your first payment. I'll see you in the quests. I decided to start out with a model that's not the newest model, but Quen 2.5 has probably the most different variants out there. Look at this. I'm on Quen's hugging face, and we've got four pages of Quen 2.5 models. There's coder varieties. There's 1 million context window varieties. I just wanted to go simple. We've got quen 2.5 billion. We've got 7 billion. There's also 3 billion. There's 14 billion, 32 billion parameters, and 72 billion parameters. There's a lot to pick from. So I thought this would be a good model to get a lot of different combinations and see how they interact with each other. So for example, if you use uh the target model to be 72 billion parameter model, that's if you want to run that 72 billion parameter model. How do you make it a little bit faster? Well, if your target model is 72 billion parameters, then your draft model will be smaller. It'll be like a 7 billion or a 3 billion or a 1.5 billion or a.5 billion. But how do you know which one to use? That's what we're trying to answer because it's not something that's documented anywhere. There's also different quantizations of each one. So 72 billion parameters comes in FP16, Q40, Q4 km, Q8, Q6. So, I got a couple of those. I got the 8-bit version and I got the 4-bit version of the 72 billion parameter model. And then, for example, a 7 billion parameter model. You'd think, "Oh, that might be a good one as a draft model." By the way, I'll explain all this in a moment. Just hang on with me for a second. You'd think, "Oh, 7 billion is uh pretty good, right?" But then you go in here and you say, "Which 7 billion should I run? There's so many. Here's one by Bartowski." But it's not just one. There's two bit, there's three bit, there's four bit, there's five, six, eight, even 16. Well, as it turns out, the 7 billion parameter model is not a good draft model for the 72 billion. 72 billion Q80 quant gives us 8.7 tokens per second on this M3 Ultra Max Studio that I got over here. 8.7 hardly usable, but look, we can get up to 27.6 6 tokens per second from the same model by using the best 1.5 billion draft model and.5 billion draft model is pretty good too. 25.2 tokens per second. Then the best 3 billion draft model gives us 27. And finally there's the 7 billion draft model 26.2. So 7 billion not the best but still not bad. All these smaller models running as draft models using speculative decoding. There, I said it. Okay. Make this model actually usable. Put this model in the usable range with the tokens per second. Give me a moment and I'll go into these details as well. I know that a lot of you already know this, but some of you might not. What is this dirty word, speculative decoding, or as what we're going to call it now, guessing check? Well, if we go back to LM Studio, it has a really nice way of visualizing this. Let's start from scratch. I'm going to create a new chat. I already have the model loaded. This is 70 billion Meta Lama 3.1 8bit. If I open up this little sidebar here and go to tweak some things up here and then go down to the bottom where it says speculative decoding, it's either on or off. It's most likely going to be off to begin with, but you can select a draft model. Smaller models run much faster no matter what hardware you're using. And because it runs faster, it's going to be able to guess at the next token that you need much faster than the big model. So, it's going to take a guess. And look, all these small models, these draft models are going to be compatible with the main target model. LM Studio knows that. The reason they're compatible is because they share the same tokenization language. You see, along with all these files that you usually see when you download a model, let's go to Quen for example. Quen 2.514 billion instruct files and versions. Usually when you download these you not only get the models the weights which are the giant files but you also get these tokenizer JSON vocab JSON all these extra config JSON all these extra files that describe the model and describe the architecture of the model as well as the tokenization language and the vocab and if the vocab is the same between models for example models with the same architecture like quen 2.5 whether it's 14 billion or 72 billion or 32 billion they're all going to share the same vocabulary So, you're going to be able to use them as draft models for the big models. In the case of Llama 3.1, these are the ones that you can use as draft models. It's showing me that I can use the 70 billion parameter instruct model as a draft model for the 70 billion parameter model target model that I want to run. But I can guarantee you that's not going to be fast at all. If you select something really small like this tiny tiny model, the Llama 3.2, 2, which is also compatible by the way, 1 billion parameter model. It's only 712 megabytes in size. And you run this together, then you're going to get output that's much faster. Most of the time, not always though. And that's why I wrote this tool. You see these statistics down here? These you usually get the tokens per second, the number of tokens, how long it takes, stop reason. But this one, 64.7 draft tokens accepted, is something you only see when speculative decoding is turned on. You can also click this visualize accepted draft tokens button and it'll make every token that the smaller model guessed and the larger model accepted. It'll make it green. So you can see that there's quite a lot of these that were guessed correctly by this tiny llama 3.21 billion model and then it guessed it and the bigger model says okay that's good. We'll go with that. So instead of the big model having to generate those tokens from scratch, all it did was just accept the smaller model's tokens, making the whole thing that much faster, two times faster. All right, so now that I explained how the draft model/target model system works, and by the way, this works in LM Studio, this works in other tools like VLM and Llama CPP. The tool I created will support multiple, but right now it does only Llama CPP. However, it's open source and you can check it out and you can commit pull requests if you want to add something to it. It's called Draftbench. You can find it on my GitHub. And the problem that it solves is it finds which draft model works best with the right target model. If you select too small a model or a model that's too highly quantized, it might uh just degrade performance because it's going to give you poor results and the acceptance rate will be low. If you select a model that's too big, it's going to slow things down. So you want to have that just right and this solution helps you find that sweet spot. So you provide a list of target models and a list of draft models and this benchmark will do a sweep across all those and it'll do the combinations that you want to find the best solution. So you don't have to do it manually. This takes you through the steps. You build llama CPP. I've shown how to do this before especially for members of the channel. Thank you members. Members of the channel get extra videos where I go into a little bit more detail about certain things. So, you can follow these instructions. By the way, if you are a member and you want to see a detailed rundown of how to run this thing, let me know in the comments down below and also include your special emojis that members can use. But here is the code. This is the benchmark code. It does a couple of prompts that are optimized to show you the best results for speculative decoding. So, it does three prompts for each one and does an average at the end. Overall, this takes many, many hours to run. You probably want to run it overnight, but it also depends on how many combinations you want to run. So, here's an example of a config you can set up where you have your target models. In my case, I have a separate file for the 72 billion parameter models. And I have the 72 billion Q8 here and the Q4KM that I tested. And these are all the draft models that I tested with that one. Anything that's basically smaller up to 7 billion. I didn't do 14 or 32. That was kind of pointless. So 14 to 32 are good candidates for target models, I thought. Because what if you're not running it on a MacBook Pro with 128 gigs of memory or a Mac Studio with 512 gigs of memory? What if you're running it on a MacBook Air? Then you can probably get away with running a 14 billion parameter model. That's a Q4KM or a Q40 and then a smaller draft model. But again, which one? So let's see some of the results here. Here are the results for the 14 billion parameter models and and I did four targets. FP16, Q80, Q4KM, and Q40. And as you can see, speculative decoding using the draft model helps with all of these across the board, but some of them it helps more than others. So, some of you might think, oh, why would I run the FP16 version of the 14 billion parameter model when I can run a quantized version like the Q80 or the Q4KM? Well, because when you quantize a model, it usually degrades some of the quality of the result that you get. So, the less quantized the model technically is better. Not always, but most of the time it is. So, the higher quality will be like FP16 quant or the Q8 quant and then Q4s will follow. And if you take a look at the baseline with no draft model, this is just how it runs by itself. 22 tokens per second for this 14 billion parameter model and FP16 is going to be pretty slow. Then if you go to Q8, it's going to be a little faster, 38 tokens per second. And if you go to Q4KM, 55 tokens per second. And Q4 will give you the fastest time, 58 tokens per second. So you think, oh, I'm always going to run the Q4 because that's the fastest. Not so fast. I mean, you know what I mean. Look what happens when you add a draft model. Suddenly running the FB16 version of the 14 billion parameter model is not that much slower than the Q4 version. FB16 with the best 1.5 billion draft. So using a draft model that's 1.5 billion, but there's a lot of different quantizations of that. The best quantization is what this chart shows is bringing you up to almost 72 tokens per second from 22. That's crazy. That's a 216% jump. And if you compare that with 14 billion Q4 and you use the best draft model for that, that's only 79. So we're not that far off. And we can use the FP16 version of the 14 billion parameter model and get higher quality results. So that's what this chart shows. This next chart shows the speed up versus baseline percentage. And you can see how much more benefit the draft models will be for an FB16 quant of that 14 billion parameter model. 216% 213 181%. And then for the Q8 we're getting 116%. We're still doing really good for the Q8. Q4. We're still getting a benefit in the Q4 even, but not as much of a benefit as the Q8 or the FP16 quants. Finally, down here, you'll see a heat map. This heat map will show you the target model at the top. Obviously, you want to go green. You don't want to go red. Right down here in the very bottom right corner is the speed up of only 2.7%. It's still a speed up. Ah, here's a slowdown. This one is no good at all. 17.4% 4% if you're using Q4 and a draft model of 3 billion FP16. No good. So obviously you want to be somewhere up here. Let's take a look at some other ones. If you're running a 7 billion parameter model as your target model, you still get benefits. Those people that are going to be running this are going to be running it on a really small machines like I don't know a machine that has maybe 8 gigs of VRAM or 16 gigs of unifi memory on Apple silicon for example. So, I ran Q2K and Q3KM, but I don't recommend those models at all. Like, you're going to be losing way too much quality by using target models that are quantized so highly. So, don't do that. But here, 7 billion FP16, pretty good little bump you get there. Speed up versus baseline. Obviously, the FP16 gets the most improvement there. Then the Q8, Q6, not so much. So, I'd skip that one. Q5 also not so much. Q50. Not bad. Using this uh half a billion Q8 as the draft model. Here's the whole lineup. I'll probably make these charts available as well as part of the repository in the results folder. Here's a 32 billion target model. A pretty decent model. Very capable. But look, you can use this tool on any model really, any combination as long as they share the same vocap. Take a look at that. And finally, 72 billion. We already went over that, but we didn't go over Q8 versus Q4KM. And there's a 72 billion parameter full results heat map, which gets benefits from a draft model pretty much all across the board. Even the 7 billion parameter models help it out quite a bit. I was surprised that the best improvement came not from a 7 billion parameter model, but from a 1.5 billion parameter draft. Now, besides running it on the M3 Ultra, I also ran this on the MacBook Pro. And here is just one example of the results. And it's pretty consistent. 72 billion parameters. We're getting about 200% improvement for the Q8 model with the 1.5 billion parameter draft. Yeah, it's about the same uh kind of heat map as well. So anyway, go check this out, play around with it, use it on your own models, and let me know how it goes. Also, you might really enjoy this video next. Speculative decoding son of a Yeah.
Video description
Stop wasting your hardware—here is how to 2x or 3x your local LLM performance 🔥Click this link https://boot.dev/?promo=ALEXZISKIND and use my code ALEXZISKIND to get 25% off your first payment for boot.dev. 🛒 Gear Links 🛒 🪛🪛Highly rated precision driver kit: https://amzn.to/4fkMVfg 💻☕ Favorite 15" display with magnet: https://amzn.to/3zD1DhQ 🎧⚡ Great 40Gbps T4 enclosure: https://amzn.to/3JNwBGW 🛠️🚀 My nvme ssd: https://amzn.to/3YLEySo 📦🎮 My gear: https://www.amazon.com/shop/alexziskind 🎥 Related Videos 🎥 🧳🧰 Mini PC portable setup - https://youtu.be/4RYmsrarOSw 🍎💻 Dev setup on Mac - https://youtu.be/KiKUN4i1SeU 💸🧠 Cheap mini runs a 70B LLM 🤯 - https://youtu.be/xyKEQjUzfAk 🧪🔥 RAM torture test on Mac - https://youtu.be/l3zIwPgan7M 🍏⚡ FREE Local LLMs on Apple Silicon | FAST! - https://youtu.be/bp2eev21Qfo 🧠📉 REALITY vs Apple’s Memory Claims | vs RTX4090m - https://youtu.be/fdvzQAWXU7A 🧬🐍 Set up Conda - https://youtu.be/2Acht_5_HTo ⚡💥 Thunderbolt 5 BREAKS Apple’s Upcharge - https://youtu.be/nHqrvxcRc7o 🧠🚀 INSANE Machine Learning on Neural Engine - https://youtu.be/Y2FOUg_jo7k 🧱🖥️ Mac Mini Cluster - https://youtu.be/GBR6pHZ68Ho * 🛠️ Developer productivity Playlist - https://www.youtube.com/playlist?list=PLPwbI_iIX3aQCRdFGM7j4TY_7STfv2aXX 🔗 AI for Coding Playlist: 📚 - https://www.youtube.com/playlist?list=PLPwbI_iIX3aSlUmRtYPfbQHt4n0YaX0qw draftbench GitHub Repo: https://github.com/alexziskind1/draftbench — — — — — — — — — ❤️ SUBSCRIBE TO MY YOUTUBE CHANNEL 📺 Click here to subscribe: https://www.youtube.com/@AZisk?sub_confirmation=1 — — — — — — — — — Join this channel to get access to perks: https://www.youtube.com/channel/UCajiMK_CY9icRhLepS8_3ug/join — — — — — — — — — 📱 ALEX on X: https://x.com/digitalix #macstudio #llm #claudecode