This Shouldn’t Be Able to Run 120B Locally

Alex Ziskind · 20.3K views · 1.8K likes

Analysis Summary

40% Low Influence

mildmoderatesevere

“Be aware that the comparison uses a 'MacBook Neo' with only 8GB of RAM as a foil to make the external hardware's performance appear more miraculous than it would against a properly specced workstation.”

Ask yourself: “Whose perspective is missing here, and would the story change if they were included?”

Transparency Mostly Transparent

Primary technique

Human Detected

98%

Signals

The content features a distinct personal voice with natural verbal errors, self-correction, and spontaneous humor that AI narrators currently lack. The narrative is driven by a specific individual's hands-on testing and personal tech history rather than a generic script.

Natural Speech Disfluencies The speaker makes a verbal slip-up saying '228 kg kilog, excuse me. 228 gram' and follows it with a self-deprecating joke 'I could have guessed that.'

Personal Anecdotes and Humor References Steve Jobs putting things in his pocket and jokes 'Stay out of my pocket, Steve.'

Contextual Physical Interaction The transcript describes real-time physical actions (weighing devices, plugging in cables, checking lights) that align with the narrative flow.

Technical Expertise and Voice The speaker uses specific developer terminology (PIP install, SDK, tokens per second) in a conversational, non-formulaic manner.

Worth Noting

Positive elements

This video provides a practical look at how specialized NPUs and 'Power Infer' software can offload heavy LLM workloads from underpowered host machines.

Be Aware

Cautionary elements

The use of an extreme 'underdog' computer (8GB RAM) as the benchmark makes the product's performance seem like a unique technical miracle rather than a standard hardware trade-off.

Influence Dimensions

How are these scored?

About this analysis

Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.

This analysis is a tool for your own thinking — what you do with it is up to you.

Analyzed March 23, 2026 at 19:27 UTC Model google/gemini-3-flash-preview-20251217 Prompt Pack bouncer_influence_analyzer 2026-03-15b App Version 0.1.0

More on This Topic

Related content covering similar topics.

Mac Studio vs. Nvidia: Running Large Models Locally!

Heavy Metal Cloud

Minimal Transparent

apple silicon large language models

Your Mac Has Hidden VRAM… Here's How to Unlock It

Alex Ziskind

Low Mostly Transparent

apple silicon large language models

Is Bigger Better? EVERY Qwen 3.5 Local AI Compared - 397B vs 122B vs 35B vs 27B 🧐

xCreate

Low Mostly Transparent

local llm apple silicon

MacBook Pro M5 Max: What Apple Is Really Building

Bobby Tonelli

Low Mostly Transparent

local llm apple silicon

Transcript

If you want to carry a local and private 120 billion parameter large language model, you might need something like this big expensive GPUs and computers. But what if I told you that 120 billion parameters can fit in your pocket? The first time I heard of something like that, Steve Jobs was putting a,000 songs in my pocket. Well, this is a slightly nerdier version than that. And no, I was not happy about Steve Jobs putting stuff in my pocket. Stay out of my pocket, Steve. For the past few years, AI hardware has been moving in one direction. bigger GPUs, bigger servers, bigger clusters. And if you watch this channel, you know I've built a few of those. Now, something very different is starting to show up. This is Tiny AI Pocket Lab. And according to the company, this thing can run models up to 120 billion parameters locally. And that's a wild claim, so I need to check it out. And that matters because one of the best known 120 billion parameter models right now is still GPTOSS 120B. It's not the newest model, but it's well known because it's OpenAI's first openweight model. And OpenAI says GPTOSS120B is best with at least 60 GB of VRAM. And it also says it runs more efficiently on a single 80 GB GPU. That means I won't be able to even run it on my expensive RTX 5090, which only has 32 GB of VRAM. So, I want to check out this tiny AI Pocket Lab with this MacBook Neo that only has 8 GB of memory. Just to give you an example, this is a 4 billion parameter model running on the MacBook and getting about 9 tokens per second. So, let's find out whether this little box can actually pull this off. First of all, this thing really lives up to its name. This is my iPhone 16. This is this thing. I also have to weigh it. My iPhone with the case weighs 228 kg kilog, excuse me. 228 gram. I could have guessed that. And tiny weighs 305. And the reason I'm pairing it up with this MacBook Neo is because the MacBook only has 8 gigs of memory. There's no way you'll be able to run anything bigger than a 4 billion parameter model. Or maybe I'm doing a separate video on this. But one of the biggest things about large language models and running them locally is space. 256 GB SSD here. I'm already almost out of space here. and I just have software development tools on there. Imagine loading LLMs. This thing comes with 1 terbte of SSD space, so you'll be able to load up a bunch of models. It also has its own CPU, and of course, it has the memory, 80 GB of it, so it can run larger models. Now, to get going with this, I'm going to connect this thing with a USBC cable to PC, it says. Now, it does operate up to 60 watts, 30 watt TDP. So, you can actually power this externally by plugging in a USB adapter here into the power section of it. And I'm going to power it on. There's a little light right there. It shows me that it's on. It shows it up as a drive. And I just need to double click on this to set up the software. Tiny OS. Pop that open. And after login, you just get to the chat screen. And right there, I'm already chatting with this thing. I have chat available. I have agents available. I have an agent store. And all these different models that I can try. There's GPTO OSS 12B that I was talking about. Let's click on that. I'm going to try that out. And there it goes. Let's give it a little bit of a longer prompt just so you get a sense of how fast is going on a 120 billion parameter model. This, by the way, is the default chat interface. Now, people watching this channel might be interested in the fact that they have an SDK, not just the chat interface. That way, you can interact with the models programmatically. PIP install tiny SDK. Boom. That's it. Now, I can use this in my programs. Load up a device. Load up the models in my Python code. Or I can use this right from the terminal. So I can do tiny run open AI GPTOSS120B. And now I can interact with this fully through the command line. And they weren't kidding. It's 18 tokens per second for this model. Now this is completely offline and completely not connected to the internet. Everything is right here. You only have to connect to the internet if you want to download new models. And so far these are the ones that I have available to me that I've already downloaded. GBTOSs 12B, Quen 330B, Quen 3 coder, image creation. We'll take a look at that. And out of the ones that are available that I haven't downloaded yet, GLM 4.7 Flash looks to be pretty interesting. So, I'm going to click on get. Network is offline. Please check your connection. So, before I can use anything else, I would need to connect this to Wi-Fi so I can download the new model. So, there I am downloading GLM 4.7 Flash. And what's nice here is that it's not going through the machine. Even though you think, "Oh, why doesn't it just use the connection and use the Wi-Fi that's on this thing?" Well, that's the whole idea is is its own device. We don't need to take up space on the computer to take up space on the device. So, the model goes directly to the device instead of going through the computer where you might not have enough space. There's 12B answering me. Let's ask it to write a story. And while it's thinking about it, you can see the approximate speed that it's going at. Look at our activity monitor. Look at the memory usage here. 5 GB. That's pretty normal. There is no spikes or anything here. No memory pressure. Nothing is being used on this computer. So I can keep using this as is. All the processing is done on Tiny Pocket Lab. All right. Thank you. That's a very long story. We don't need that. Let's load up a coding model. Did my GLM flash download yet? It's still downloading, but we do have Quen Coder 30B. Hey camera, what are you looking at? The screen. You like what you see? I like it. Quen coder downloading. Got to do something while it's loading, right? New chat. And hello and quen coder is much faster obviously than 120 billion. This one is only 30. And this is not a thinking model. So it answers right away. But why would I be interested in a coder model? Well, because this is a software dev channel. So how can we use this besides chatting with it as software developers? Well, first of all, there's a bunch of different agents you can use right here on the left. Stable diffusion, chat memos, and AI assistant. There's a bit too many different agents to go into in this video, but let me know if you want to learn more about these and maybe I'll do a follow-up. But there's also a agent store where you can go in here and obviously this is the kind of thing that's going to keep getting updates. More and more stuff is going to be active here. Now, here is the dashboard. This shows me how many tokens I've used, which models I used, and how many tokens per model. This is really handy because it gives you a little bit of visibility and insight into how you're using your resources. For example, if you're developing a solution locally that's going to be deployed somewhere else, you can kind of gauge the token use and estimate costs later if you're using some third party provider. Pretty handy. Now, one thing I have not tried yet that I'm about to. Always scary doing this first time on camera, but there's an API base URL and an API key that will allow you to use Tiny in an OpenAI compatible interface. And in fact, if I take that weird looking base URL and I go to the browser and plug it in, it'll tell me exactly what's going on. So here's /models endpoint and yeah, that is an open AAI compatible endpoint. It tells me that Quinn 3 coder in fact is loaded. So here I am in VS Code and I've configured Quen 3 coder 30B to be my custom AI agent inside my editor. And now I can talk to it through my VS Code interface. What does this file do? This file is a batch script for Windows that automates the setup and execution of a Python project. That's right. So, it has tool calling enabled. It has the ability to read and edit files and so on. And the speed is not terrible. Now, I don't have to take my MacBook Pro on an airplane. I can travel light with this stuff. Here's an example of a smaller model, Quen 38B. And that's pretty fast. Try again. Write a story. Now, what's funny is that the MacBook doesn't have a fan at all. But the tiny box, the Pocket Lab, does make up for that because it has its own fan. It's definitely needs to cool that NPU that's doing the processing. So, the only thing I hear here in this office is the tiny Pocket Lab. It's just a persistent fan going. You can definitely hear it. It's not silent, but it's not like wildly annoying. By the way, when you use this inside VS Code as a coding agent, you go through a lot of tokens. just that couple of queries that I made. I'm already over 20,000 tokens. Another win for local, right? Cuz you get an unlimited number. GLM flash finally downloaded and now I can start talking to it. So now I have even more models stored on this thing that I can anytime just pull up and use whenever I want. Let's check out SD web UI. A cat in a hat. Do you think it's going to draw a cat inside a hat or a cat wearing a hat? Instead, it draws a red square. So it says here, please select a text to image model from the status bar. So you can only have one model loaded at a time. And right now I have GLM 4.7 loaded, as you can see here, Flash, which makes that model available to all the different tools and agents. If I want a stable diffusion type of model to be available to the stable diffusion UI, then I need to load up the Z image turbo or another text image model that's going to be available, which unloads GLM 4.7 flash. Makes sense, right? You have a limited number of resources to work with. So unloading a model before loading a new one makes sense. All right, let's try generating that. And there it goes. Looks like stable diffusion. All right, any guesses? I just see a cat. I see zero hats in that picture. A cat playing guitar. Now, if you know a little bit more about stable diffusion than I do, then you might know what all these things do and the best options to select. But you're not me and I don't. So, I'm going to just hit generate here. Hey, look at that. We're getting a cat playing guitar. What a cute little pussycat. What's cool about this interface is that it actually tells you what model are available, what models you already have downloaded, the ones with the check mark, and the models that you could still get that'll work with stable diffusion. Now, one of the questions that comes to mind is how is it that such a tiny little device with an NPU can compete with, I don't know, an internal GPU like for example, Apple Silicon. So, here's Tiny on the left running Quen 8B, which is a bigger model than the one I'm running on the right. That's Quen 34B. And let's take a look at how that looks. I'm going to go enter and enter here. They're not at the same time, but close enough just to give us a sense of how quickly things are generating. Hello. There it goes. Okay, it's thinking. The one on the right is taking a little bit longer. And you tell me which one do you think is faster here. I think the NPU one is faster even though it's a larger model. The one executing on the Mac is actually running MLX. Now, this is the MacBook Neo, so it's not going to be super fast like a M4 or M5 Max machine, but we're talking about really low power consumption lasting a long time. How is it that Tiny is doing that? Well, the foundation of that can be found on GitHub. Tiny AI power infer is the name of the repository. You can go check it out. This is kind of like the backbone of the product. Oh, look at that. GitHub trending number one repository of the day. Power Infer is basically a CPU/GPU LLM inference engine leveraging activation locality for your device. I don't know what the heck that means, but basically that means keeping the actively used, the hot parts of the model constantly active and alive while putting some of the ones that are less commonly used to sleep. It's like a clever way of managing the model so that it can run faster. You can get more info on tiny.ai AI website as well as their Kickstarter campaign where you can sign up and get this thing at a discounted rate if you're an early backer. Now, I've wanted to play with this thing ever since I saw this at CES and I think it's really cool. Now, this thing is not going to be a replacement for a monster GPU rig. It's slower and right now it's more curated than a fully open DIY setup. That's kind of clear. But if you want a small and private low power box that can bring meaningful local AI to a much weaker laptop or perhaps a mini PC that's less capable, that's where this starts to make a lot of sense. And yes, being on a Kickstarter makes it a harder sell for some people, which is totally fair, but the idea here is genuinely interesting. And in actual use, I think it makes a stronger case than you might expect. Let me know your thoughts in the comments down below. I look forward to checking it out and reading it. I do read all your comments and I think you would like this video next. Thanks for watching and I'll see you next time.

Video description

I paired a tiny AI box with the MacBook Neo—and it seriously changed what I thought was possible with local AI. Tiiny box: https://tiiny.ai 👀 My favorite external drive (dependable): https://amzn.to/3Os9Wi3 👀 Thunderbolt 4 dock: https://amzn.to/3yVRicC ⚡ *Other gear I use:* https://www.amazon.com/shop/alexziskind 🎥 Related Videos 🎥 🧬🐍 Mac Studio CLUSTER vs M3 Ultra 🤯 - https://youtu.be/d8yS-2OyJhw 🧳🧰 Mini PC portable setup - https://youtu.be/4RYmsrarOSw 🍎💻 Dev setup on Mac - https://youtu.be/KiKUN4i1SeU 💸🧠 Cheap mini runs a 70B LLM 🤯 - https://youtu.be/xyKEQjUzfAk 🧪🔥 RAM torture test on Mac - https://youtu.be/l3zIwPgan7M 🍏⚡ FREE Local LLMs on Apple Silicon | FAST! - https://youtu.be/bp2eev21Qfo 🧠📉 REALITY vs Apple’s Memory Claims | vs RTX4090m - https://youtu.be/fdvzQAWXU7A ⚡💥 Thunderbolt 5 BREAKS Apple’s Upcharge - https://youtu.be/nHqrvxcRc7o 🧠🚀 INSANE Machine Learning on Neural Engine - https://youtu.be/Y2FOUg_jo7k 🧱🖥️ Mac Mini Cluster - https://youtu.be/GBR6pHZ68Ho * 🛠️ Developer productivity Playlist - https://www.youtube.com/playlist?list=PLPwbI_iIX3aQCRdFGM7j4TY_7STfv2aXX — — — — — — — — — ❤️ SUBSCRIBE TO MY YOUTUBE CHANNEL 📺 Click here to subscribe: https://www.youtube.com/@AZisk?sub_confirmation=1 Join this channel to get access to perks: https://www.youtube.com/channel/UCajiMK_CY9icRhLepS8_3ug/join — — — — — — — — — 📱LET'S CONNECT ON SOCIAL MEDIA ALEX ON TWITTER: https://twitter.com/digitalix — — — — — — — — — #macstudio #tiiny #llm