We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Zen van Riel · 63.7K views · 2.4K likes
Analysis Summary
Ask yourself: “Did I notice what this video wanted from me, and did I decide freely to say yes?”
Worth Noting
Positive elements
- This video provides highly specific, practical troubleshooting for connecting Claude Code to local LLMs, particularly regarding context window limits and system prompts.
Be Aware
Cautionary elements
- The use of a future date (2026) and high-end hardware creates an artificial sense of 'cutting edge' exclusivity designed to funnel viewers into a paid career community.
Influence Dimensions
How are these scored?About this analysis
Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.
This analysis is a tool for your own thinking — what you do with it is up to you.
Related content covering similar topics.
Transcript
You'll learn the best local AI coding workflow for 2026. In this video, we will be using the latest and greatest quen models, routing our local models through clock code, and even using any AI model that you want on weak laptops locally using LM Studio Link. You don't want to miss out on this. So, let's get right into it. Welcome to my local setup. This is my Linux machine with my RTX 1590 with 32 GB of VRAM. I'm going to be using a couple of models throughout this video. And the first one is a new quen 3.5 model with 35 billion parameters. And you can see that my GPU blazes through this Python code at more than 100 tokens per second. And this is because this model is quite big, but not all of the parameters are active when you're asking a question. This is the benefit of a mixture of expert model which is very common with modern local AI systems. So you can even see it 140 tokens a second and it's already done with this question. What's important to realize here is that if you cannot fit the entire model on your GPU like I'm going to show now, the performance of the AI model is going to be much worse. In this case, I'm actually loading some of the parameters of the model into my system RAM instead. And that's going to lead to a lot of data having to be transported back and forth. And you will see that the performance will be much poorer. So just because you can fit a model on your system by putting some of the parameters on your system RAM doesn't mean that it's actually going to be usable in practice. Especially for agent coding, you're going to be using very big context windows where the compute cost basically scales exponentially. So you really have to experiment and see which model can truly fit on your GPU at acceptable speeds if you want to really code proper solutions with it. Next, I want to be exposing this local model to my MacBook, which is my main development environment. I could be running a similar model there, but it will be much slower compared to this GPU. So, we're going to be using LM Studios new linking functionality to expose an encrypted connection between two devices. So, I can effectively run this model quote unquote locally on the MacBook. It's very easy to set up. All I have to do is basically open LM Studio on these two different devices. So, I've logged into MM Studio on the Ubuntu device. And now we're going to just hop over to the MacBook and open it up here. And then here on the MacBook, I can already just browse to the linking functionality. And then I will be seeing that YUbuntu machine pop up immediately. And indeed, the Quen 3 coder model is already seen as loaded in here. And it's going to be super seamless to ask a question to this local model. Now, I can just select it in here as a linked model. And then I can do a new chat and ask it to generate some Python code. So now you can see that the model is available as a linked model on my YUbuntu machine. And just to prove that it's running locally here on the top right, you can see that my GPU is starting to spike up and it's only doing it for a short moment because I just asked it to generate a very simple Python script. But we do have this connection set up properly now, which is very nice. So now what's next? Well, LM Studio is a nice chat interface, but it's not really a good interface for truly coding some complex solutions. So, I'm going to be connecting Claude Code to LM Studio. And the first step is to enable the local server so that I can point Claude code to it because since a couple of months, you can connect cloud code to LM models of your choice. You don't just have to rely on the models by anthropic anymore. So, I don't know how to do this off the top of my head. And in this chat, I was just basically asking it to research itself, how to change its own settings so that it could point itself to the LM Studio API. Now, it's good to know that LM Studio exposes an API that has multiple endpoints. There is an API that's compatible with the Open AI API standard, but there's also a specific one that's compatible with the anthropic one, which is probably the easiest one to use here because Cloud Code expects that. If I go to LM Studio, you will be able to see those different supported endpoints. We've got, you know, a chat interface, which is the Alm Studio API, but there's an OpenAI compatible endpoint as well. But more importantly, there is also an anthropic compatible endpoint v1/ messages, which is the one that we want. So I can basically just tell cloud code that we have that endpoint available. So it can give us the right recommendation for the command to connect it to. While the AI is thinking, I want to make sure that you've already subscribed to this channel because most of the people watching my channel are not subscribed. And if you don't subscribe, you will miss out on a latest in creatives in AI engineering. So make sure to click the button below. So after a little while, it basically asks us to export these two environment variables to override the anthropic base URL and API key. There might be many other ways to get this done, but this will work for me. So I'm overriding the anthropic base URL and key now and just saying hello to Claude over the command line. And you can indeed see that that command is now being sent to my local GPU. it's actually taking a while to respond. The reason for this is because cloud code injects quite a lot of context into its system prompt and it's very easy to miss out on this detail and think that everything is going to be as fast as an empty chat, but that's not true at all. And this is what a lot of YouTube videos are actually missing. You will see that your AI model will be much slower when you connect it over cloud code. It's not really a free lunch. You can see right here that it takes a long time for it to process the prompt because claude code simply gives it a huge system prompt with all kinds of directives on how to code properly. All these videos were promoting cloud code via local models. I feel like most of the people promoting this are not using it themselves because unless you have a very powerful machine, this is going to be extremely slow as your repository grows in size. It's simply one of the limitations to local AI coding. Regardless though, you're able to customize this prompt and get some things out of it or use a different CLI provider that has a more lean prompt. But in my case, it just takes a while and now it's finally starting to generate that answer. Again though, I'm using a coding model that doesn't fit entirely on my GPU. So, we're going to be optimizing that later. For now, you'll be able to see that we get that response. How can I help you? Well, it took a long time to get that response, but this is because of that context window that's being filled by the system prompt of Cloud Code. Next, what we want to do is optimize this a little bit because we're obviously not going to be able to code if it takes 2 minutes to get any kind of response. So now what we're going to be doing is switching to the Quen 3.5 model, which is not specifically made for programming, but it's still a very competent language model, so it will do a pretty good job and this will fit on my GPU entirely. Now, I'm making one mistake on purpose. I'm using the default settings of LM Studio with only a 4,000 token context window. And you will see that if I try this request again, it will hang indefinitely because the cloud code system prompt is thousands of tokens long, we are actually going to be hitting the limit of my local model immediately and there's no clear error message indicating this. So this is another tip to look out for when you are trying to combine CLI tools that expect a huge context window when you haven't set it up properly yourself. So now we're going to be increasing the context window to something more closer to 80,000 tokens. And this is also necessary because as you ingest a lot of files to be able to answer code questions or to be able to come up with new API endpoints, you need to have a long context window. And now you will see that it's actually responding a lot faster. But one thing that's a little bit weird about this answer is that it says that it's sonnet. How come? Well, again, cloud code is injecting the system prompt into the language model. And even though we're using a quen model, because that system prompt says that it's cloud set, it thinks this as well. It's another very important thing to realize about language models. They don't always have self-awareness of the model that they actually are. The system prompt that they are fed really dictates their behavior very much. Now, if I add these environment variables to my terminal, I can launch the regular cloud UI and it will use my local AI model. We're getting a bit of an off conflict here, but for now, it's fine to ignore that. So, we can finally start building something. And to test out my model in detail, we're going to be building a bit of a full stack application to interact with the LM Studio API. Why not? So, we can just say plan out a sample repo that has a Nex.js TypeScript projects to showcase your ability to create a full stack app. Be a little bit creative with the concept and don't just, you know, recommend a lame to-do list app because we have seen a million of those already. In fact, we are showcasing LM Studios ability to share models between PCs. It would be nice if you can mimic their UI that shows the health of the server with loaded models by exploring the API available at and then in LM Studio. I'm just going to paste the documentation of the API cuz I'm just searching here for the right endpoint there. There's a REST API document that you can get to. There we go. Open documentation. And I can manually copy paste this, but there's a simple copy as markdown button. So, I'll paste an entire description to just ground the model in the API of LM Studio because it probably doesn't know that. And now we're going to go ahead and use plan mode with my local model. And you can see it actually starts to respond pretty quickly because with everything I've set up now, I've optimized it to run on my 5090 directly and it's able to take care of quick responses. So, it's just going to go ahead and explore the codebase, which is not too exciting because there's nothing in the codebase as of yet. And then it's going to create the plan based on all of that. And now it starts to ask me questions. So this shows you that the local model, even though it's not, you know, Claude Opus, the latest and greatest, it does actually use the tool calling pretty well because now it's asking me a couple of questions like the primary focus of this demo application. So I'm just going to say that it should just be a simple dashboard as proof of concept and the next.js back end will sit in the middle as a proxy. So we could just have a simple HTML page that would interact with the API directly. But I want to prove that this system can build a full stack app. So we're just going to have the Nex.js back end pass the request from the front end to the LM Studio API. Hence, I'm just going to call it a proxy for now. And then in terms of interactivity, well, I want to keep this simple. We're just going to have a simple connection to the LM Studio API and it's just going to be, you know, a simple dashboard. And then in terms of the a IML component, we're just going to leave that out for now. We're just going to keep it simple. Now, after a bit of planning, you can see here that I've used 45,000 tokens out of my 200,000 tokens, but that doesn't really represent the real local AI model. Because this is just cloud code, it thinks I'm using clots on 4.6. So, this might not represent the maximum amount of tokens depending on the local model configuration you have, but it is nice to see how many of the tokens are being used by, for example, the system prompt, which is indeed already 3,000 tokens, as well as all the messages you've sent so far. So you know when you maybe have to summarize a conversation or clear out and start with a fresh conversation history. And after a while by just time skipping here we've got seven different spec files that we can implement in order to build out this full stack dashboard. So we're really just working on this entire you know agentic flow where we're first creating our specs. Then we'll have the AI agent work them out and we should end up with a pretty nice end result here. And just checking up on some of the code samples that it's writing. You can see that there is some scaffolding code here where we're going to have a back end that's going to call that v1 API on the LM Studio side. And after all of this planning, we ended up using 65,000 tokens. So what happens when I try to fill the context window? Well, I'm just pasting a bunch of extra stuff in here to show you that it is still able to respond no problem at all. And the moment that I do this, you will actually have different behaviors depending on how you configured LM Studio because you can configure the behavior of when the context window of the model has been hit fully. For example, in this case, what I've done in my settings is I've set the context overflow to truncate the middle of the conversation history. This keeps some of that initial conversation history where you explore the codebase but it will basically get rid of a lot of things that happened in the middle of the conversation which does of course reduce the memory of your LLM but it frees up space for you to continue the conversation. Sometimes though cloud code will take care of this on its own and sometimes it will proactively summarize the conversation history again. It's just good to be aware that there are different ways to go about compressing your conversation history so you can keep chatting even if you have a limited context window. Next, we want to implement the full solution. And I like to use the bypass all permissions mode for cloud code so I don't have to press, you know, enter for every single small change. The way I'm going to do that safely is I'm going to run inside of a dev container. I've got many videos explaining how that works. They will basically isolate my development environment so I'm able to run cloud code in bypass all permissions mode. And of course, I'm going to now set my model to be 200,000 tokens. And the main reason why I'm doing it this way is because I don't mind the decrease in speed. I have bypass all permission mode on, so I can just walk away from my PC, come back later, and it's totally fine if it takes 20 minutes longer to work out this full stick application. Now, I'm going to ask it to sequentially work out each spec. Now, one thing that's important to note is that I'm explicitly asking it to create sub agents for each task. This means that will create new instance of cloud code with a fresh context window to work on one piece of work and then report back to the main agent. This way I'm able to get much more out of the limited context window that I might have for a local model. So I definitely recommend you to work with sub agents more than ever if you're doing local AI coding. So after a while, 30 minutes or so to be exact, we have a dashboard that seems to be working, but I ran into quite a couple of bugs. And to be honest, there were some hard-coded information here, like this Nvidia RTX 3080 GPU being used. That's just, you know, made up on the spot by the LLM. It's pretty typical, right? If you don't specify everything in detail, it's going to make things up. But for the purpose of this demonstration, I want the models to be shown, the real models that are loaded in LM Studio. And to get that done, I had to pass more documentation about how the API worked and get it to actually fix the code that I had written so far. To be honest, this is the same for state-of-the-art models. You have to keep coding. You have to fix bugs. But it is good to be aware that the local models simply aren't as good as what you get from, for example, the latest claopus model. So, you do have to be realistic and realize that you're probably going to get more bugs simply because your models aren't as strong. One great way to fix bugs is actually to make sure that the LM agent is able to call the backend APIs directly that you're trying to integrate with because that way it's able to selfassess whether it's calling the APIs properly. So in this case I'm giving it instructions for how it can call the LM Studio API on its own. So it can align the output format of the API with the code that it's writing. So it'll be much more accurate and bug free. Given some extra time, you can now see that we have a nice overview of the models. And it indeed also knows that the Quen 3.5 model has been installed. Even here, we can see there's a couple of weird details that are hardcoded like this 256k context window, which is the maximum context window for that model, but it's not the actual, you know, limit that I configured. But even so, you can see here that we're still working on some of the endpoints, but at the very least, the model one is returning a valid response from the server. So granted, we still have some work to do here, but clearly we're able to create a real fullc application using a local model connected via cloud code. And in fact, that model is not even running on my MacBook, but it's running on the Linux machine using the link feature of LM Studio. So I hope that you enjoy this new way to work with local models, and you should definitely try this workflow for yourself because it is much more powerful than things that were possible 2 years ago. It's still not the same as using the best state-of-the-art cloud models, but if you are a privacy enthusiast, you should definitely get into this because local AI coding has never been better before. If you want to learn more like this, definitely subscribe to the channel, but also check out my AI engineering community in the link in the description below and sign up for my free resources to learn
Video description
🎁 Get my FREE local AI projects: https://zenvanriel.com/open-source ⚡ Become a high-earning AI engineer: https://aiengineer.community/join Run Claude Code with local AI models using LM Studio and a powerful GPU, no cloud API keys needed. In this video, I walk through my complete local AI coding setup: running Qwen 3.5 on an RTX 5090, linking my Linux GPU machine to my MacBook using LM Studio Link, and connecting Claude Code to local models through LM Studio's Anthropic-compatible API endpoint. Then I build a full-stack Next.js dashboard from scratch using only local AI to prove it actually works and I share the real limitations most YouTubers won't tell you about. Sources & Documentation - Claude Code LLM Gateway Configuration: https://code.claude.com/docs/en/llm-gateway - Use Your LM Studio Models in Claude Code (LM Studio Blog): https://lmstudio.ai/blog/claudecode - LM Link — Use Your Local Models Remotely (Docs): https://lmstudio.ai/docs/lmlink - LM Link Product Page: https://lmstudio.ai/link - LM Studio Developer Docs (Local Server & API): https://lmstudio.ai/docs/developer What You'll Learn - How to run local AI models on your GPU for coding (Qwen 3.5 35B on RTX 5090) - Why GPU offloading vs system RAM makes a huge difference in local AI speed - How to use LM Studio Link to share models between machines (Linux GPU → MacBook) - How to connect Claude Code to local models via LM Studio's Anthropic-compatible endpoint - Why Claude Code's system prompt makes local models much slower than empty chat - The context window pitfall: why the default 4K tokens hangs Claude Code and how to fix it - Building a real full-stack Next.js app with Claude Code and local AI models - How to use sub-agents in Claude Code to maximize a limited local context window - Running Claude Code in bypass-all-permissions mode safely with dev containers - Honest comparison: local AI coding vs cloud models — bugs, speed, and trade-offs Timestamps 0:19 My Local AI Linux Machine 1:43 Expose LLM to weak laptop (LM Studio Link) 3:01 Connecting Claude Code to local model 6:16 Optimize LLM Parameters Locally 7:58 Scoping a full-stack app 8:59 Grounding local AI in documentation 10:41 Keeping track of context 12:09 Context overflow settings 13:09 200k Context Window Build 15:17 The Final App Result Why I Made This Video Everyone is promoting Claude Code with local models, but most of them aren't actually using it for real projects. I wanted to show the full workflow end-to-end — from setting up your GPU to building a real full-stack app including the pitfalls, slowdowns, and honest limitations that other videos leave out. #localai #claudecode #lmstudio #localllm #aicoding #localcoding #qwen #localaicode #localclaudecode #aiengineering #gpu #selfhostedai #privacyai #lmlink #localcodeassistant Connect LinkedIn: https://www.linkedin.com/in/zen-van-riel Community: https://www.skool.com/ai-engineer Sponsorships & Business Inquiries: business@aiengineer.community