4 AI Labs Built the Same System Without Talking to Each Other (And Nobody's Discussing Why)

AI News & Strategy Daily | Nate B Jones · 55.3K views · 1.7K likes

Analysis Summary

45% Low Influence

mildmoderatesevere

“Be aware of the 'revelation' framing that suggests a fundamental truth about AI has been 'missed' by everyone else; this creates an artificial sense of urgency to adopt the host's specific strategic frameworks.”

Transparency Mostly Transparent

Primary technique

Human Detected

95%

Signals

The transcript exhibits a distinct personal voice, rhetorical flair, and subjective emotional framing ('keeping me up at night') that is characteristic of human thought leadership rather than synthetic generation. The presence of a personal website and newsletter further validates the content as a human-led production.

Personal Voice and Anecdotes The speaker uses first-person phrases like 'It's keeping me up at night,' 'I want to suggest,' and 'I don't know about you, but the last time I solved an international Olympiad math problem...'

Natural Speech Disfluencies The transcript includes conversational fillers and rhetorical self-corrections like 'Oh, never. It just doesn't happen at work' and 'Let me dive into what I mean and then you decide whether I'm right or whether I'm out to lunch.'

Specific Domain Expertise The content references specific, niche industry developments (Cursor's math breakthrough, planner-worker-judge architecture) with a nuanced perspective rather than generic summaries.

Worth Noting

Positive elements

This video provides a clear explanation of 'agentic workflows' (planner-worker-judge) and how they differ from simple single-turn chatbot interactions.

Be Aware

Cautionary elements

The use of 'revelation framing' to make standard industry evolutions feel like a secret discovery that requires the host's specific guidance to navigate.

Influence Dimensions

How are these scored?

About this analysis

Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.

This analysis is a tool for your own thinking — what you do with it is up to you.

Analyzed March 13, 2026 at 16:08 UTC Model google/gemini-3-flash-preview-20251217 Prompt Pack bouncer_influence_analyzer 2026-03-11a App Version 0.1.0

More on This Topic

Related content covering similar topics.

Building an AI App in TypeScript & Bun with Groq SDK using Function Calling

Zaiste Programming

Minimal Transparent

large language models ai agents

WTF Is OpenClaw? And Should You Even Care?

Elevated Systems

Minimal Transparent

large language models ai agents

Level Up Your LangChain4j Apps for Production

Java

Minimal Transparent

large language models ai agents

Building AI agents for 127 million customers: Practical lessons from Nubank

Building Nubank

Low Transparent

large language models ai agents

Build Apps Faster with AI | Vibe Coding with Goose

goose OSS

Minimal Transparent

large language models ai agents

Transcript

What if AI isn't jagged anymore? That is what I cannot stop thinking about. It's keeping me up at night. Everyone has assumed that AI capabilities have a jagged pattern. That they're incredible at some things and they're terrible at others. Experts talk about this. We see this in our daily lives. It seems like a truism. Why would I challenge it? The jaggedness has become an organizing frame for just about everything. We think about how we install AI that way. We think about how we teach AI that way. But there is something we got wrong about jaggedness. The jagged frontier was never an inherent property of AI intelligence. I want to suggest it was an artifact of how we were asking the AI to work and that we are starting to figure that out as AI gets better. Let me dive into what I mean and then you decide whether I'm right or whether I'm out to lunch. Here's what happens when you ask a model a typical question. We're going to start with a basic. So you see that jaggedness just the way we've all seen it in chats. I've seen it too. When you ask a model for one answer in one turn, so here's a question I want an answer. All of the variance in task difficulty shows up as jaggedness in outcomes. That's not necessarily because the intelligence is jagged. It's because no organizational structure was being applied to that work. We have been asking a capable analyst to solve every problem in 30 seconds with no notes, no colleagues, no ability to try something, and no ability to retry. Now, that mental model isn't fully correct anymore. That was initially true in 2022, but I want to walk you forward. Now, we're looking at inference computing where AI takes time to decide and has some tokens it can process. This is what you get with Chat GPT 5.2 thinking and 5.2 Pro. It can think, it can maybe try some tokens that don't work. It can correct its mistakes and it can come back. This produces higher quality results. We see better performance. But most of our conversation has talked about the fact that we see better performance and maybe we haven't noticed that the jaggedness has started to smooth out. You don't have issues with counting the Rs and Strawberry anymore, do you? The mental model that shaped three years of AI strategy needs to change. It needs to change because the last 30 days have convinced me that jaggedness is no longer the right guiding paradigm for how AI works in the workplace. It is certainly true that there are extraordinary capabilities for AI and there are capabilities for AI that are just very good. Now, that is a kind of jaggedness. That is not a super relevant jaggedness because I don't know about you, but the last time I solved an international Olympiad math problem at work was, "Oh, never. It just doesn't happen at work." So, I'm interested in the practical. In that world, the world of PRDs, the world of code, the world of customer service tickets, AI is not jagged anymore. And we need to stop pretending that it is. And I know why. And knowing why is going to help you understand how to do your work differently. So let's go back and consider what single turn single agent interaction actually means. You present a problem. The model produces a response. If it contains an error midway through, the error propagates through every single thing that follows. If the first approach is wrong, there's no mechanism to detect that and to try something else. If the task requires more information than fits in a context window, it cannot accumulate that information incrementally very well. Every problem needs to be solved in a shot. This is the most primitive version of a chatbot. It's close to what we experienced with Chad JPT when it initially launched. And this is not how any competent human professional works. We know that, right? It's not how a lawyer researches a case. It's not how an engineer designs a system. It's not how a scientist runs an experiment. All of these involve trying things, recognizing when they're not working, adjusting, accumulating information over time, getting feedback at intermediate stages, and revising. And yet all of these organizational structures that we've built around professional work, the review processes, the sprint cycles, the peer feedback loops, the draft, revise, publish pipeline, they all exist because we have a hard time solving oneshot cognition problems too. And we seem to have forgotten that AI might be able to use that help too, guys. So we deployed AI originally into a paradigm that removes so many of those structures that help us to think and then we describe the resulting limitations as a property of AI. Now I am very aware I'm simplifying the story. So I'm going to walk you forward in a way that you understand what's going on here. That was 2022. We learned a lot. We got inference which I described earlier in this video. It helps AI not make mistakes. We got some tools for AI that helps AI a lot as well. We also realized that we need to be better at describing our tasks that helps AI too and that's what we call prompting. So we have been working on our side to provide tools and at the same time I'm very aware AI has been getting smarter because we've been scaling intelligence. We've been scaling it partly through inference which I described and also partly through reinforcement learning the old tried method that we've been using since the beginning of LLMs. And so what we see is a trend line where intelligence has been climbing, but our fluency at using the tool has been getting better, too. And we haven't been tracking that curve, guys. We've been talking about the intelligence curve. We have not been talking about the curve that allows us to actually use this tool, the ability to learn to put agents into harnesses, the ability to learn to use tools in a loop to do practical work. And what we really haven't recognized is that we are in a learning trend line that matters more than the intelligence curve at this point, at least for practical work. Because figuring out the scale at which we can operate intelligence now is a function of our ability to use tools with agents, our ability to use harnesses with agents. harnesses is the state around the agent, the scaffolding around the agent, the thing the agent operates within that allows it to do work. Maybe it's a markdown file for tasks. Maybe it's a spot to put its memory. All of it comes together. It's a harness. It allows it to do meaningful work. We've forgotten that part as valuable. We've forgotten that if we do that well, maybe we will address the jaggedness. And so when the first couple months of the year arrived, here we are surprised when the jaggedness starts to disappear. All at once. All at once, guys. Video starts to get better. Tech starts to get better, mathematics gets better, science gets better. I'm talking about specific advances that have been happening in the last 60 days. We are not seeing jagged improvements anymore. We are seeing a pattern of improvements where everything is getting better at once. The frontier of AI is smoothing and we are seeing much more smoothing if we look at the smaller bubble that is work because work is inside the frontier at this point. Like I said, the last time I solved a math problem like an international Olympiad athlete was never. We don't do it at work. Only a few of us are doing complicated science at work. Very few of us are actually doing super complicated engineering problems at work. Now we may be putting that effort in one time but then we are executing against that to build out the product. For us, for most of our work, this is a smooth product. It is not jagged and we have got to recognize how big a deal that is because that changes all of our assumptions about where we should expect AI to work and where we should deploy. I hear you saying, Nate, what's the proof here? You've given vague generalities. Come on. The proof arrived on March 3 when Curser CEO Michael Trule announced that Curser had discovered a novel solution to problem six of the first proof. Yes, we're back to math, but not for long. A researchgrade mathematics problem drawn from the unpublished work of Stanford, MIT, and Berkeley academics. In other words, you can't reinforcement learn on it. It's unpublished. It didn't just solve it. It improved on the official human written solution. stronger bounds, better coverage. And they did it. Cursor did this using the exact same coding harness that six weeks earlier had built a web browser from scratch. The harness ran for 4 days on this math problem with zero hints, zero human nudges, and zero midcourse guidance. And then it solved it. So here's why I'm saying smoothing matters, people. Cursor did not build to solve math problems. It's one thing if Google's like, "We put this special math model together. It's super special and it did a special math." Or if OpenAI says the same thing, we put a special math model together and it did a special math. Great. Good for you. But in this case, it matters more because cursor is not a mathematics solving company. Cursor is a coding company. A system designed to write code looked at a problem in spectral graph theory. You tell me what that is. I have no idea. and produced mathematics that the problems own authors hadn't found. This is a huge deal and I think Michael Trule put it well. What he said was this suggests that our technique for scaling agent coordination might generalize beyond coding. I will go farther. I will say it suggests that the way we put agencies into harnesses to do longunning work looks like it will work for any domain that is even reasonably verifiable. In other words, that we can reasonably determine a correct answer to. That opens up a lot. That's not just math. That's not just code. That opens up legal. That opens up in fact many customer service use cases because there's a verifiable correct answer. There are a surprising number of verifiable or nearverifiable problems in the business world where the answer is basically no. We we know what's correct and what's not. This is a situation where the architecture of the agent marched up and just ate the problem set and along the way changed the way we should think about AI and jaggedness and our daily work. And so you might be wondering what's in the box on this cursor agent. Is it something special? Is it secret sauce they're not going to share? No, they're going to share. In fact, they did share. In January, Wilson Lynn published a cursor blog post on scaling longrunning autonomous coding. The first attempt was flat coordination, right? Agents shared a single file. They used locks to avoid collision and it failed very badly. Agents became risk averse. They avoided difficult tasks and they optimized for small and safe changes. You got lots of activity, but you did not get much progress. The breakthrough came from hierarchy and specialization. Planners explore the codebase and create tasks spawning subplanners recursively. So there's two layers here. Workers pick up individual tasks and grind until done and they ignore everything else. A judge, an LLM is judge determines whether to continue and the next iteration begins a fresh. The judge's ability to restart cleanly, bringing in a new agent with fresh context turned out to be one of the systems most important properties because it got around the problem of the context window. And so, as I mentioned earlier, the test case was building a web browser from scratch in Rust. The agents ran for a week and wrote a million lines of code. Cursor ran the same harness, add a solid to React migration and got that to work. They ran it on a Java language server. These are all coding problems. They ran it on a Windows 7 emulator, 1.2 2 million lines and Excel clone 1.6 million lines. I think the cursor team is having fun, guys. Two lessons emerged. First, model choice matters a lot for long horizon tasks. They found that GPT 5.2 consistently outperforms Clawude Opus, which tends to stop earlier and take shortcuts. Second, and more counterintuitively, many of the improvements they made came from removing complexity in the Agentic system rather than adding to it. So, the actual improvement came from stripping out a lot of the complicated coordination machinery, adding hierarchy, and letting agents work in very clean isolation. It is probably not an accident that that harness looks very similar to the codeex harness that you can set up if you download and use the codeex app where you have agents running in isolation in sandboxes. And the deepest observation is this. The systems behavior is disproportionately determined by the design of the prompt. Yes, prompting is still going to matter in the future, guys. I've been saying it for a long time. If you can prompt with all of the information, the complete solution, what the model needs to do to be correct and you set up your model harness correctly, it will run for a long time. And so cursor got excited. They got experimental and they pointed at this math problem. Cursor system found an approach that I can barely pronounce. Something involving the Marcus Spielman's SVA interlacing polomial method. Don't ask me that and don't try and say that five times fast. But the point is it solved it and it went beyond what humans did. This should wake you up. If you are thinking that a coding agent does code, if you are thinking that a LLM is a narrow thing, this should wake you up. It is not a narrow thing. These LLMs, especially in agents, are generalizing broadly. And this goes back to what I was saying earlier in the video. We have assumed that jagged responses from LLMs are a function of intelligence. But the lesson that's in plain sight over the last few years is that it's actually been at least as much a function of the harness we put the agent in. Now at this point, four organizations, Anthropic, Google Deep Mind, OpenAI, and Cursor have independently built very large multi- aent coordination systems designed to do long horizon work. None have coordinated. All four exhibit a similar structural pattern. And to my mind, this hasn't been clearly articulated. So hear me now. This is not as different as it sounds. There are some differences in their patterns that are related to the models they use, but there's underlying architectures that are similar. Decompose the work. One, parallelize the execution. Two, verify outputs. Three, and then iterate toward completion. Four, anthropics approach is an initializer agent that sets up an environment state and a progress file. A coding agent then makes incremental progress and leaves structured artifacts that the next session can read. Without this structure, the failure modes are super vivid, right? The agent might try to oneshot the whole implementation. They might run out of context midbu. They might leave things worse than they started or make features complete without testing them. Google deep minds approach especially with the althea mathematics model separates generation, verification, and revision into very distinct roles. The same principle underlying code review, underlying legal adversarial proceedings, and underlying scientific peer review. Do you see how this shows up in many fields? OpenAI's codeex runs tasks in parallel sandbox environments. Like I was saying, incursor's planner worker judge approach is structurally similar to how software teams with engineers and a PM actually operate. Yes, I hope it's not lost on you. This is also how people work. This convergence is not a coincidence. It is a solution to a real problem. How do you get useful work from units of intelligence with finite context, finite perstep reliability and no persistent memory? I know that we chuckle, but there's some comparison to some of us too. We also are units of intelligence. We have finite context. We have finite reliability. We make mistakes and some days we don't have as good a memory as others. The answer for both the human organization and the agent turns out to be organizational. You create roles, you create handoffs, you create verification, you restart procedures. These are not AI specific insights. They're management insights that generalize to autonomous agents as naturally as they generalize to human teams. In other words, we figured out how to generalize our intelligence by working collectively. And we seem to have forgotten those lessons and replicated them without realizing it. Because it turns out that when you do the exact same thing that we have been doing to organize human work and you stick it in with agents and you put it in a harness that also is a good way to solve for agents doing meaningful work that they could not do individually. That is a big big deal. In other words, we humans have figured out a form of organizational intelligence and now we are giving it to agents and it turns out it scales. We should pay more attention to this than we are. Look, there's an obvious critique at this point, and I can hear people coughing in the back. Multi-agent harnesses are extremely expensive. And I want to engage with this honestly because it's not entirely incorrect. At the most fundamental level, multi-agent systems generate a ton of tokens that would not be generated by a single turn interaction. So, the cost is real. You have to be ready to enable token burn if you go with this kind of a system. But multi- aent systems give you an organizational strength you can't get any other way for the hardest problems. They give you structural diversity. Parallel workers can explore different decompositions of the problem. Dead-end results and can inform the next planning cycle without contaminating another worker's context. The planner can spawn subplanners to go deep on specific problems. Partial progress can accumulate across context windows rather than resetting. This is about organization design. A brilliant individual with unlimited time can in principle solve almost anything. But certain problem classes are structurally inaccessible to serial cognition. Not because the individual lacks the capability, but because the problem requires too many exploratory paths to hold in working memory simultaneously. So we don't structure organizations as one very smart person trying to do everything. We structure them with roles and with handoffs and verification because otherwise we don't get consistent progress regardless of individual talent. Now I am here to tell you I've told you before I think teams of one have a special role in the world of AI because teams of one are really teams of more than one. If you are a team of one and you can manage this kind of multi- aent system, you can be a team of a hundred and you're just you. But I want you to keep in mind the fact that meaningful progress in human terms has almost always involved a team. Even these moments that we celebrate in science, like Einstein's breakthrough, were both a function of individual genius and also a function of the scientific community around that individual. As Isaac Newton said, we stand on the shoulders of giants. So, but let's say you believe me. Let's say you're like, okay, I get it, Nate. We are looking at a world where things are more smooth than they appear. I haven't been paying attention to harnesses. I promise to. I haven't been paying attention to tooling. What can I look for at work that matters here? I want to suggest to you that you need to start thinking about two tiers of domain verifiability. The first is the simplest. It's the one that's easiest to get a hold of. It's the one we tend to think of with AI these days. It's machine checkable. The code compiles or it doesn't. the tests pass or fail. The second tier is expert checkable with clear criteria. So mathematical proofs, engineering designs, legal briefs. I think there's actually many categories of work inside the knowledgebased economy that run like this because in almost every field experts will look at something and say this is correct or incorrect and come to a near consensus. I would say that's true even for things that we traditionally would not consider to be a tier 2 problem. Let me give you an example. If you were constructing a product strategy for a particular company and you bring that product strategy to three or four different product leaders each with 15 or 20 years of experience, I am willing to bet you a lunch that their assessment of that product strategy will be remarkably consistent. there is a set of patterns that they have internalized that they are able to apply in that particular situation. In other words, that kind of work which we have traditionally said is soft work is very hard to verify think it's more verifiable than we think. And if that's more verifiable than we think, we really should be thinking about a lot more of our work as something AI can access and be helpful. So what does this mean for work? The anthropic 2026 agentic coding trends report describes engineers delegating tasks where they can easily sniff check on correctness. I love that because I think it's not just engineers. The story of cursor, the story of this moment where cursor's coding specific harness was able to generalize is suggesting to me that we are on the cusp of being able to assign a whole lot of work inside the organization to agentic workflows as long as we can easily sniff check for correctness. How much work can we sniff check for correctness? Every single department has a lot of work in that category. Marketing has work where they can sniff check a campaign design for correctness. Customer success has work where they can sniff check an email template schema for correctness. And it's tempting to look at this and say, well, engineers are delegating all the work. So, what will engineers do? And first, we have to be honest. Smoothing means delegating the hard work, too. It's not just delegating the easy stuff. And I see so many organizations in 2025 or 2024 thinking when they say we'll just delegate the easy stuff, the simple stuff. The best organizations are delegating the hard stuff as long as they can actually sniff check the work. And so the skill that survives this transition isn't I can do the work, right? No, it's I can sniff check. I can tell if the work is correct or not. I can tell if this is what we should be doing. In other words, everything at work is moving to meta skills. I use meta-kills as a word a lot. We've got to just underline it in pen here. Meta skills like knowing whether architecture is maintainable, like recognizing a solution that is fragile, like understanding when tests cover all the important cases and you don't need more. These kind of meta skills get more valuable as harnesses improve, not less. So the question outside engineering is pretty simple. What does sniffch checkcking look like for us, for financial modelers, for legal researchers, for clinical trial designers, for product managers, for customer success folks? In each case, an evaluation competency sits above execution competency and value and becomes even more valuable as execution gets cheaper with agents. So, the people who develop the ability to do those sniff checks are the ones who will find themselves really, really well positioned as harnesses come for their domains. And I'm here to tell you it's coming fast. So let me come back to the four companies that independently built the same structure. Anthropic, Google, Cursor, OpenAI, they all built some version of decompose a problem, parallelize the problem for agents, verify that problem, and then keep iterating till you get it done. What that implies is that there is an underlying structure for how we solve problems at work that is now solved. It is solved by agents. It is just done. They have figured it out. If we can set up the right agentic harness, we should be able in principle to tackle any problem at work that we can decompose, parallelize, verify, and iterate on any problem at all. We are not limited. The surface is smooth. I want to be very clear at this point. The relevant question for you and for me and for anyone doing knowledge work right now is shifting very quickly from can AI do a specific task in my job family which I hear all the time to can my work be decomposed into verifiable subpros which is a mouthful but is much more relevant and I'm here to tell you that the answer is yes far more often than most of us are comfortable with and I think we're uncomfortable because it requires us to level up. It requires us to think about how we evaluate. It requires us to do a sniff check. The organizations that figure out how to make this a productive shift, how to structure teams so that they think about delegation to agents and they know what tasks they can delegate and they have support for that. They're going to have to work very differently. We're going to have to have much more talent thinking about agent infrastructure and building it. We're going to have to have much more talent and training around how we think about taste and understanding if something is correct and decomposing problems. There's this massive migration to do. That's going to take a lot of people. Did you notice that I said a lot of people? It is not actually easy to install this stuff. It's hard. And I know that the easy thing to do is to say, "Wow, you know, cursor built a browser. What are we going to do?" The leverage that these agents provide is leverage that we can extend our impact with. It's also a challenge to how we do our daily work. And the latter is the thing that I find is most scary for folks is that we cannot continue our current habits. And I cannot promise you that you can. Anywhere in the workplace today, your work is going to have to change. And the only thing you can control is whether you understand that and get ahead of it by being proactive and start to map out your domain and say what can I delegate? How can I be a sniff checker, a tastemaker, an agent infrastructure builder? How can I shift into a mode where I am bringing the agents into the space on my terms to extend my leverage? Or you're just going to sit there passively and it's going to happen anyway and that's a much worse place to be. The lesson I want to call out here is that AI is smooth for work. Capabilities are not jagged at work. That is a misunderstanding. The reason why is really cool. The reason why is that we essentially scaled all of our organizational intelligence learning tools for people into AI over the last 2 or 3 years. And it finally hit a tipping point. And now it's basically solved anything at work that we can decompose, that we can parallelize, that we can verify, and that we can iterate. And that turns out to be a ton of work. That is a huge piece of the story. I think it is much more important than whatever scoreboard score drops tomorrow. Think about that. Remember that the world is not as jagged as it seems. And uh don't worry if you can't solve an Olympiad math problem. That is not what our future depends on. Twos.

Video description

My site: https://natebjones.com Full Story w/ Prompts: https://natesnewsletter.substack.com/p/cursors-coding-agents-solved-a-math?r=1z4sm5&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true ___________________ What's really happening with AI capabilities at work — and why the "jagged AI" frame is now obsolete? The common story is that AI is brilliant at some things and broken at others — but the reality is that jaggedness was never about intelligence; it was about how we were deploying it. In this video, I share the inside scoop on why AI agents in proper harnesses are smoothing the capability frontier for real work: - Why the jagged AI frontier was always a deployment problem - How multi-agent coordination unlocks long-horizon knowledge work - What Cursor's math breakthrough reveals about AI generalization - Where meta-skills like sniff-checking become your competitive edge The organizations and individuals who learn to decompose work, delegate to AI agents, and verify outputs will extend their leverage — those who don't will find the shift happening to them anyway. Chapters 0:00 The jagged AI assumption we all got wrong 1:45 Why single-turn AI mimics a 30-second analyst with no notes 3:20 Inference computing and the first signs of smoothing 5:10 Why jaggedness was never about intelligence 7:00 The Cursor math breakthrough that changed everything 10:15 How the planner-worker-judge architecture works 13:00 Four companies independently built the same structure 15:30 The real cost of multi-agent systems (and why it's worth it) 17:45 Two tiers of domain verifiability at work 20:30 What sniff-checking means for every knowledge worker 23:00 Meta-skills are the new execution skills 25:10 What this means for your organization right now 27:00 The surface is smooth — what to do about it Subscribe for daily AI strategy and news. For deeper playbooks and analysis: https://natesnewsletter.substack.com/ Listen to this video as a podcast. - Spotify: https://open.spotify.com/show/0gkFdjd1wptEKJKLu9LbZ4 - Apple Podcasts: https://podcasts.apple.com/us/podcast/ai-news-strategy-daily-with-nate-b-jones/id1877109372