Andrej Karpathy's Math Proves Agent Skills Will Fail. Here's What to Build Instead.

The AI Automators · 56.8K views · 2.0K likes

Analysis Summary

20% Low Influence

mildmoderatesevere

“Be aware that the impressive demo is from their ongoing app-building series, naturally leading to an open pitch for their paid community with full access.”

Ask yourself: “Did I notice what this video wanted from me, and did I decide freely to say yes?”

Transparency Transparent

Primary technique

Human Detected

92%

Signals

The video features a human creator providing expert commentary and technical tutorials. The speech patterns are natural, the content is highly specific to the creator's ongoing series, and the presentation lacks the rhythmic monotony or generic phrasing typical of AI-generated content farms.

Natural Speech Patterns The transcript contains natural conversational markers like 'And the thing is', 'you know', and 'I've done a full deep dive into skills on this channel', which indicate a personal connection to the content.

Contextual Continuity The speaker references specific previous episodes and internal channel history ('Episode 5', 'Episode 4') in a way that suggests a consistent human creator building a series.

Technical Nuance The explanation of 'harness engineering' versus 'agent skills' involves specific, high-level synthesis of recent industry news (Stripe, Anthropic) delivered with a personal perspective on reliability.

Worth Noting

Positive elements

Provides a concrete, step-by-step demo of implementing an 8-phase contract review harness with sub-agents, file management, and cost-optimized models in Python/React.

Be Aware

Cautionary elements

The tutorial uses the creators' own app series as the example, seamlessly transitioning to promoting their paid community for complete access.

Influence Dimensions

How are these scored?

About this analysis

Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.

This analysis is a tool for your own thinking — what you do with it is up to you.

Analyzed March 29, 2026 at 20:17 UTC Model x-ai/grok-4.1-fast Prompt Pack bouncer_influence_analyzer 2026-03-28a App Version 0.1.0

Transcript

A lot of people are banking on 2026 to be the year where AI gets real work done and delivers real business value. And I'm not talking about small stuff like writing blog posts or drafting social media content. I'm talking about an AI system that can successfully execute complex workflows that really impact the bottom line. Things like compliance audits, risk analysis, financial reports, impact assessments. These are complex multi-stage human processes that involve large amounts of data that are in theory primed for AI. And the biggest challenge with these types of workflows is reliability. Andre Carpathy describes this as the march of nines where you can reach the first 90% of reliability with a strong build and a good demo. But each additional nine requires comparable engineering effort to achieve. And the thing is agentic workflows compound failure. for a 10step agentic workflow like let's say a compliance audit. If you have a 90% success rate per step, then running that workflow 10 times a day will result in over six failures every day. If you can boost your success rate to 99% per step, then you're down to one failure every day. Achieve 99.9% every step and you're down to one failure every 10 days and so on. And while this example may be a bit extreme because you might have human in the loop, you might have non-agentic steps in a flow, the key thing here is for businesses to fully adopt these AI systems, they need to be highly dependable and reliable like traditional software. One possible solution to this is the concept of agent skills. These are portable self-contained units of domain knowledge and procedural logic along with optional supporting files that can help in achieving the task. I've done a full deep dive into skills on this channel which I'll leave a link for above. But essentially for a complex workflow like customer on boarding, you can document the specific operating procedure in the skill markdown file. Step 1 2 3 and four as you can see here. And the AI's task is to follow the steps to achieve the outcome. And that's exactly what Anthropic did last month when they released their concept of plugins which are essentially bundles of skills that are domain specific. You know, legal, finance, HR. And even though these are essentially just markdown files, this sent a shock wave through the stock market, triggering a mass selloff in the stocks of SAS companies. And while this concept of plugins and skills is a very powerful idea, they are by no means perfect. Agent skills are essentially just prompts. You're baking your process into a message to the AI and you're hoping that it adheres to the instructions, hoping it doesn't hallucinate, quit early, skip steps, etc. Skillsbench carried out an evaluation of 84 popular skills in the market across all models. And while the addition of skills did definitely improve the pass rates of these tests, the overall success rates are well shy of what a business would need to reliably use that at scale without human intervention. There are ways that you can improve the performance of skills through evalment, but you will never reach incredibly high levels of reliability through prompting alone. The solution is to harness the power of these AI systems by putting them on deterministic rails. And this is exactly what Stripe did with their concept of minions where they built a scaffold around clawed code to ensure all generated code changes like bug fixes or new features were automatically validated against a subset of their 3 million tests in their test suite. They didn't just prompt the AI to carry out tests. They guaranteed it by baking it into the process. And with this harness in place, they're able to merge 1,300 pull requests every week. So for complex multi-stage longunning workflows, the best approach is to create a specialized harness where you can gate and validate the output of each stage to ensure it stays on track. And this is just one aspect of harness engineering which is an evolving discipline. Because harnesses are essentially just the software layer that wraps around an AI model, there are lots of different harness designs and architectures that you can create. General purpose harnesses like clawed code and manis are incredibly powerful. Whereas for these multi-stage complex workflows, specialized harnesses are the way to go. But there's lots of others. Autonomous harnesses like openclaw, hierarchical and multi-agent harnesses where you have swarms of agents that are coordinated or DAG harnesses where your workflow is plotted on a graph and you can have the likes of branching, conditional splitting, parallel execution. To demonstrate the concepts of harness engineering, I built a specialized harness into our Python and React app that I'm building out on this channel as part of our AI builder series. I took inspiration from Anthropic's legal plug-in and their contract review skill. I took the steps in their skills file and codified the process into a more comprehensive and reliable system. And this is it in action. I've dropped in the logo of a law firm here because this type of complex workflow, a contract review workflow, is only worth building into a specialized harness if you need to operate it at scale. So depending on the size of the law firm, they may have a few of these every week. So firstly, we want to specifically select contract review as a mode, which is different to skills where it's up to the AI to decide to pull it in or not. So we definitely want to trigger the harness here. So contract review, we then upload our file. Let's go with our sample SAS agreement and we'll just say please review and you can see our file has been uploaded to the workspace. We click go and there's lots of concepts now of harness engineering that you're going to see in action. So the idea of a virtual file system is the first thing and now you can see a plan that's appeared and these are to-dos that are being checked off and this is the harness in action. So this is all codified in Python created by clawed code. This process that has now been executed is essentially the standard operating procedure let's say of the law firm. It's extracted the text from the document as part of phase one. There's verification that it has got what it needs and then it moves to phase two which classifies the contract. This is an LLM call again with a structured validated schema that needs to be populated. And then we're into phase three which is asking the user clarifying questions before it carries out the analysis. So this is an example of human in the loop. So which side are we on? Let's say we're representing the customer. Deadline is tomorrow and let's leave it at that. So then phase four, it loads up the playbook. So we have our playbook within our doc section here. So this is essentially rag. You have your standard operating procedures, your precedents, your policies, etc. So it has completed that research and then it moves on to clause extraction. And for very large contracts, you can use chunking here. So it's not a single shot. And this is the beauty of specialized harnesses because it's Python, you have total flexibility on how you want to actually do this. So it's successfully extracted 34 clauses. And then the really interesting part kicks off, which is the risk analysis. So as part of our process, for every single clause, we want to spin up a dedicated LLM to carry out research and risk analysis. So you can see different tool calls for every single clause. It's loading up both the playbook as well as any other research that might be available, let's say, within the knowledge base for this law firm. And this is the concept of sub aents within a harness. So all of these sub aents have isolated context. So it's not polluting the context of the main agent. This is acting as the orchestrator. So then we're into phase seven, red line generation. And again, sub agents kicking off to actually carry out the tasks. And this gives the scale that's needed for very large contracts and longunning tasks. So we have generated our 22 red lines and now it's creating an executive summary. So you can see the plan is constantly being updated. And then you can also see the files. This is the scratch pad of this agent within this workspace. Every phase generates a file. So that way you have resilience. So if there is an issue at any point, you can restart the process halfway through and load up the progress from a previous phase. And there's our executive summary at which point the harness is now complete. And this is the word document that it generated. So we can download it. And in terms of reliability, this document is a template. This is fully programmatically generated in the harness. If you were leaving this up to the LLM to generate a word document every time, you would get different formats. Sometimes it might fail completely. Whereas having it fully scripted in the harness means it will execute against your template every time. It contains the executive summary, the various yellow lines and red lines with original text and proposed text along with rationale. Again, all baked into the logic of the harness. And if we look at the bottom here, you can see that there's only 7,000 tokens used of this main agent. Whereas if we go into Langfuse and jump into that specific thread, you can see that overall this thread took 323,000 tokens. So that is just a huge amount of sub agents that have been triggered to carry out a really detailed analysis of this contract. And another interesting aspect of a harness is that you can use different models for different tasks. So our main agent that we're conversing with in this thread is Gemini 2.5 Pro. I'm using open router here. Whereas if you jump into any of the sub agents, these deep agent tasks, you can see we're using Gemini 2.5 flash. So it's obviously a lot cheaper because these are obviously much more specialized tasks that they're carrying out at scale. So we need to keep the costs under control, but we're still getting the accuracy that we need from the smaller model because it's a very narrow task that we're asking them to complete. And from there, you can then converse with the agent about the actual report. It has full access to the file system. So it can make changes to files, carry out more research against the knowledge base, whatever you want to do. So everything you just saw there, the harness, the sub agents, skills, documents, this is all built out in our AI builder series on our YouTube channel. This is the sixth episode in the series. And if you'd like to build along, the PRDs for this module are available in our public GitHub repo. While our full AI builder course and codebase are available in our community, the AI automators. This is a private community of hundreds of serious builders, all creating specialized harnesses and advanced rag systems. We'd love to see you in here. So, if you'd like to join, check out the link in the description below. So, as you can see, there are lots of benefits to building agent harnesses to solve real world problems. It helps you keep longunning tasks on track. It handles tasks that are just too complicated for an agent to complete within a single context window. It solves the problem of context rot because you're able to protect the context window of the main agent that you're conversing with. So you're not maxing it out and getting garbled incoherent answers. With the harness, you can build in observability and transparency. And similar to what you saw with the generation of the word document earlier, because you can actually programmatically do a lot of things within a harness, you can improve its reliability. And through using cheaper models within sub agents, you can keep costs under control while still burning lots of tokens. If you are looking to build an agent harness for your AI system, here are 12 things you absolutely need to know when it comes to designing it. The first is harness architecture. I showed this slide earlier about the different design patterns that you can use. And from a helicopter view, it is worth researching and investigating these types of design patterns so you can get your project off to the right start. Within my project here, I have a single threaded supervisor essentially as part of a specialized harness. A key aspect of harnesses is the idea of planning. And all of the popular harnesses like Clawude Code, like Manis have a version of planning to be able to keep their long running agents on track. And the research shows this that the longer an AI runs, the more tool calls it uses, if it's not able to ground itself on the general outline of a plan, in a lot of cases, it can end up totally off track from the original request. And it is useful to think of these plans as either fixed or dynamic. So within this contract review system, we have eight phases and it's the exact same steps each time. Whereas you can also have a dynamic plan. So here I'm just going to use deep mode which is a system we've created. And I'm going to ask it to plan my birthday party. And the LLM is going to generate its own plan depending on the request. So, you can see it's written its own to-dos. Propose a theme, find a venue, plan activities, suggest food. And what's interesting about this type of dynamic plan is that as it works through step by step, it has the ability to change the ordering of the items. It can tick things off. It can remove items, add new items in. So, it is totally dynamic. And that's why this type of dynamic plan is not suitable for my contract review harness because I don't want the LLM making it up as it goes along. I want to actually rein it in. I want it on deterministic rails. All harnesses make use of a file system in one shape or another. In cloud code, for example, it's a CLI application that has full access to the directory of your codebase. Whereas for the likes of Manis or my own system here, I have built a virtual file system. As you can see on the bottom right, this is essentially a scratch pad that the agent is able to write files to, read files from, make updates, etc. And then the scope of this file system, this is essentially a workspace that's tied to this chat thread. As you saw in our demo, the idea of delegating tasks is a key part of a harness. Because if you're not delegating tasks, it's essentially just a single LLM call that has tool calling capabilities. By delegating tasks, you're able to achieve context isolation. So each of those sub aents has a completely fresh context window and you have total control over what you inject into it. And as you saw earlier, I was able to use cheaper, faster models for the sub aents while keeping the more sophisticated, more expensive agent for the orchestrator or the supervisor that I was conversing with. And the beauty of delegating the sub agents is you can have parallel processing. So this just triggered five sub agents and it's just triggered another five sub aents in parallel in batches in fact. So Manis does this very well with their wide research functionality where it could research 500 different products for example or web pages in parallel and in the space of a few minutes generate a really comprehensive report. So parallel processing of sub aents where there isn't actually dependencies in between them is very effective. tool calling and then guard rails around what tools that can be called is a key part of a harness as well. You saw here with our load playbook we carried out a number of different tool calls to traverse the knowledge base list GP glob and read but you could also have human in the loop style approval whereby if you were pushing let's say this contract review to a legal software system you could have it so that it requires a manual approval in this interface. So those types of access controls and guardrails you can build into your custom harness. Memory is a key aspect of harnesses, particularly the likes of automated harnesses like openclaw. And there are two key aspects to memory, short-term and long-term. Short-term memory is generally saved as markdown files and then programmatically read into system prompts to continue on the process. Long-term memory can also be saved to markdown files, but obviously you need it to persist outside of a single workspace. But it doesn't need to be a markdown file either. You could use a knowledge graph for example, the likes of a temporal graph like graffiti. And if you look at openclaw, for example, every time it's event triggered, it's able to read from its memory to figure out what to do next. Specialized harnesses are essentially state machines. Here we have a sequential eightphase process. And as you can see with the plan on the top right, it is keeping track of its state as it progresses through. You can obviously get a lot more sophisticated with this type of statebased workflow. And the key aspect then becomes how do you actually track state within our system here which is built on superbase. We have a harness runs table which keeps track of the status of each harness run the current phase that it is in. So this is essentially state management where the actual state machine is codified in the contract review Python file within the harness engine itself. So even if you're not a developer, cloud code will be able to build out quite sophisticated harnesses like this. You'll find code execution is pretty central to most harnesses as well. Modernday agents typically interact with file systems via CLI using sandboxes. And this is something I went into in a lot of detail in our last video around programmatic tool calling within a sandbox. LMS are brilliant at generating code. So by passing it to a secure sandbox like this, it's able to read and write files into the workspace and then you can actually action things. For this we're using LLM sandbox which is a great GitHub repo and it spins up these isolated sandboxes as and when they're needed. Context management is obviously central to harness engineering from lots of different perspectives. Number one, you obviously want to avoid context rot. So, you want to keep your main supervisor agent or the agent you're conversing with, you want to keep their context as lean as possible. That being said, though, that will eventually max out if you keep conversing with that agent. So, you need a mechanism for compacting context and summarizing context, very similar to what you see in Claude and Claude code. And it's not just context management, you need old school prompt engineering as well, particularly if you have dedicated sub agents that you are delegating to. There's a lot of good tricks that you can use with context management. If you have tool calls that output thousands of tokens, for example, instead of reading it directly into the context, you can save it to a file and then only provide a summary of that file to the agent and then the agent has file navigation tools to list GP glob and read that file. So this is very useful particularly for the likes of a web search tool. You saw human in the loop in action earlier where even in a sequential eight-phase flow like I have here, there can be touch points with the user to guide it in a certain direction. And as I mentioned, for the likes of tool calls, you can require human approval if needed. Validation loops are a critical part of a harness and it's one area that's lacking in my system. Clawed code does this brilliantly because it can generate a piece of code, it can then test it programmatically itself, and if it fails, it can go back and iterate on the code. So if it loops through that a few times, you will end up with code that actually works. Now while that works very well for codebased applications, it's a bit different for a contract review, but it is still possible. You could run validation loops on the likes of factchecking or have a loop that runs through every clause and compares it against the playbook. So if the proposed changes don't line up, you actually get it to modify itself. So this is really where you can improve the quality of the output of the harness. And finally, agent skills are still incredibly useful even within the likes of a harness. So essentially, if you need something to happen every single time, you should codify it. Whereas, if you're looking to expand out the capabilities and then you're going to guide it as a co-pilot, it's well worth using agent skills. And on that topic, if you would like to learn more about agent skills and how you can build them into your own custom AI system, then check out this video here. Thanks so much for watching and I'll see you in the next

Video description

👉 Get ALL of our systems & join hundreds of serious AI builders in our community https://www.theaiautomators.com/?utm_source=youtube&utm_medium=video&utm_campaign=tutorial&utm_content=cc-harness 🔗 PRDs: GitHub Repo: https://github.com/theaiautomators/claude-code-agentic-rag-series/tree/main/ep6-agent-harness A lot of people are banking on 2026 as the year AI delivers real business value, not blog posts and social media drafts, but complex multi-stage workflows like compliance audits, contract reviews, and risk analysis. The biggest challenge? Reliability. As Andrej Karpathy describes with the March of Nines, a 10-step agentic workflow at 90% per step will fail over 6 times a day. Agent skills help, but prompting alone won't get you to production-grade reliability. The solution is harness engineering: putting AI systems on deterministic rails with validation, state management, and programmatic control. In this video, we build a specialized contract review harness into our Python and React app, inspired by Anthropic's legal plugin. It runs an 8-phase process with sub-agents, parallel processing, file system management, and template-based output, all orchestrated in code rather than left to the LLM to figure out. 🔗 Links: GitHub Repo (PRDs): https://github.com/theaiautomators/claude-code-agentic-rag-series Skills Bench Evaluation: https://www.skillsbench.ai/ Stripe Minions Post: https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-coding-agents-part-2 Episode 5 (Tool Calling & Sandboxes): https://www.youtube.com/watch?v=R7OCrqyGMeY Episode 4 (Agent Skills): https://www.youtube.com/watch?v=4Tp6nPZa5is Full codebase available to AI Automators community members 📌 What's covered: - The reliability problem: why agentic workflows compound failure and what the March of Nines means in practice - Agent skills: what they are, why they help, and where they fall short (Skills Bench evaluation results) - Harness engineering: putting AI on deterministic rails instead of hoping it follows instructions - Stripe's minions: how they scaffold Claude Code with 3M tests to merge 1,300 PRs per week - Live demo: 8-phase contract review harness with clause extraction, risk analysis, and red-line generation - Sub-agents with isolated context windows running cheaper models (Gemini 2.5 Flash) at scale - 12 key harness design principles: architecture, planning, file systems, delegation, tool calling, memory, state machines, code execution, context management, human-in-the-loop, validation loops, and skills 🔍 Tech stack: - Python backend / React frontend - Supabase (state management and harness runs table) - LLM Sandbox (Docker-based code execution) - Langfuse (observability and token tracking) - Virtual file system with workspace-scoped scratch pads - Template-based Word document generation for reliable output Key takeaway: For complex, multi-stage workflows that need to run at scale, agent skills alone aren't enough. Specialized harnesses let you codify your process into deterministic phases with validation, sub-agent delegation, and programmatic output, turning unreliable AI demos into production systems. 📌 This is Episode 6 of our AI Builder series where we're building a full AI agent web app from scratch using Claude Code. ⏱️ Timestamps: 00:00 Skills vs Harnesses 04:45 Specialized Harness Demo 11:50 12 Things You Need to Know #AI #HarnessEngineering #AIAgents #AgentSkills #ContractReview #ClaudeCode #AgenticRAG #LLM #Supabase #Gemini #Anthropic #AIBuilder