Claude Blackmailed Its Developers. Here's Why the System Hasn't Collapsed Yet.

AI News & Strategy Daily | Nate B Jones · 57.4K views · 1.9K likes

Analysis Summary

25% Low Influence

mildmoderatesevere

“Be aware that the nuanced 'middle ground' framing leverages your interest in AI risks to direct you toward the host's substack for 'deeper playbooks,' but this is transparently signaled.”

Ask yourself: “Whose perspective is missing here, and would the story change if they were included?”

Transparency Transparent

Primary technique

Human Detected

95%

Signals

The content exhibits clear signs of human authorship through natural speech disfluencies, a distinct personal voice, and a complex, nuanced argument that avoids the generic structure typical of AI content farms. The presence of a personal brand and newsletter further supports the verdict that this is a human-led production.

Natural Speech Patterns The transcript includes self-corrections, filler words, and conversational stutters like 'But I I'll get there. Just just you wait.'

Personal Voice and Perspective The narrator uses first-person perspective ('I've spent a long time studying', 'I've arrived at a position') and expresses a nuanced, non-formulaic opinion.

Complex Narrative Structure The script uses sophisticated rhetorical devices and metaphors ('grinding indifference of water') that go beyond standard AI-generated templates.

External Links and Identity The description links to a personal newsletter and website (natebjones.com), establishing a consistent human identity behind the content.

Worth Noting

Positive elements

Provides detailed breakdown of anti-scheming training paradoxes and emergent safety properties from real reports like Anthropic's and Apollo Research, useful for AI practitioners evaluating risks.

Be Aware

Cautionary elements

Curiosity gap in title hooks alarm to deliver reassurance that funnels interest into creator's newsletter

Influence Dimensions

How are these scored?

About this analysis

Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.

This analysis is a tool for your own thinking — what you do with it is up to you.

Analyzed March 29, 2026 at 03:40 UTC Model x-ai/grok-4.1-fast Prompt Pack bouncer_influence_analyzer 2026-03-28a App Version 0.1.0

Transcript

The people worried about Terminator and AI have a lot to go on these days. Claude blackmailed its developers to avoid being shut down. GPT 5.3 CEX is the first model that has been used to help build its own successor. Every frontier model tested by independent researchers. Every single one schemes when scheming is the fastest path to finishing its homework. The most safety obsessed AI lab on Earth just dropped its core safety promise. I'm referring to Enthropic. It's a company founded specifically because Daario Amade, the CEO, thought OpenAI was moving too fast for safety. And yet today, they abandoned their central commitment in their responsible scaling policy, which is the pledge to never train a model that they could not guarantee was safe. Chief science officer Jared Kaplan has told Time magazine, "It no longer makes sense to make unilateral commitments if competitors are blazing ahead." The company that was supposed to prove you could win the AI race without cutting corners on safety has officially concluded that you can't. And just this past week, the Pentagon threatened to use a Korean War era law to force Anthropic to strip its remaining guardrails. And at the same time, in the midst of all this, the lead safety researcher from Anthropic threw in the towel. His farewell letter warning the world of its peril has now been read over a million times. Go ahead, tell me this isn't Terminator. It isn't. And here's what almost nobody covering this story will tell you. The system is holding up better than the headlines suggest. Not because everybody's behaving well, but because the competitive, institutional, and market dynamics between these actors generate emergent safety properties that no individual actor created on purpose, properties you can't see if you're evaluating only alarming headlines in isolation. I've spent a long time studying the structural dynamics in the AI race and I've arrived at a position that will satisfy nobody. Neither the doomers nor the everything is fine crowd. The risks are real. They're acute and they're accelerating. But the resilience is also real and almost completely absent from the public conversation. Understanding both changes what you should actually worry about and what you should actually do. And what you should do, it turns out, is to learn to talk to these systems properly. But I I'll get there. Just just you wait. First, the reason this isn't Terminator is actually scarier than Terminator because Terminator at least had a motive you could understand. These systems don't want anything. They don't want to survive. They don't want to deceive you. They don't want to take over. They optimize. That's it. And I know you've seen headlines otherwise, and we will get to that. They pursue task completion with the grinding indifference of water finding the fastest path down. And if deception or self-preservation or disabling their own oversight happens to be downhill from the goal you gave them, well, that's where they'll go. Not out of malice, just out of math. The danger isn't a machine that wakes up and decides to fight us. It's a machine that will walk through us on the way to finishing what we asked for because we never told it not to and it never occurred to that machine to care. Meanwhile, the safety landscape looks like it's collapsing and is actually doing something much more interesting. It's reorganizing. Individual pledges are indeed weakening. But the structural dynamics between labs, things like market accountability, transparency norms, talent circulation, public scrutiny, those are generating interesting emergent safety properties that are harder to see and in important respects more resilient than any single company's promises ever were. And the single largest vulnerability in the system right now is actually not a model problem. And I know we usually frame safety as this model or that model could be a doomsday model. It's not even a policy failure. It's not something that any lab can fix. Do you know what the single largest vulnerability in the system is? It's you and me. Specifically, it's that we might not know how to tell these systems what we actually mean in real world scenarios, and that could result in real harm. So the gap between what we say and what we want is where misalignment actually lives in practice. And ironically, the most important AI safety skill that we can develop today isn't technical, it's communicative. But before we get there, let's look at the mechanics of misalignment. How do models misalign? What do we mean when we say misalignment? So all modern frontier models learn through the same process. The model makes predictions. Those predictions get scored against a target and the model's parameters adjust in order to improve that score. Gradient descent runs this billions of times across trillions of examples. I'm not exaggerating. The model is never told how to solve problems. It just discovers strategies on its own. There's a useful analogy here. Imagine you're learning to navigate a city and you don't have a map. Every time you take a wrong turn, someone says colder. Every time you take the right turn, someone says warmer. Over thousands of trips, you're going to build an internal sense, an internal representation of the city's layout. Not because anyone taught you that layout, but because you iteratively adjusted your behavior to maximize all of those, hey, you got it right, warmer signals. Gradient descent is like that at superhuman scale and speed. The critical insight is this. The strategies a model discovers were never ever specified by the designers of the model. I think this is widely misunderstood. Designers specify the scoring function which counts as warmer. Right? What what counts as right? The model invents its own methods to maximize that score. Some align with intent, others are technically correct but achieve the score in ways nobody anticipated or wanted. And that's not a bug. It's the core mechanism that makes these systems so powerful. A model that only did exactly what its designers explicitly anticipated would be useless for new tasks. The entire value of a frontier AI system comes from its ability to discover approaches humans didn't foresee. And the problem emerges when you deploy a model trained this way as a longunning autonomous agent, which increasingly they are. You give it a goal. You let it operate for hours or days or weeks with minimal supervision. The model encounters obstacles, failed API calls, ambiguous instructions, missing permissions, unexpected states. And because it's been trained to relentlessly optimize toward its objective, it just works around them. That persistence is a feature, not a bug. An agent that quits at the first error is frankly worthless. What makes frontier coding agents and research agents valuable is that they persist through failures. They improvise solutions and they find novel paths to completion. But novel paths to completion, the thing that makes them good and misaligned behavior are often the same thing viewed from different angles. They're two sides of the same coin. And the longer a model runs autonomously, the more opportunities exist for its novel paths to diverge from what you really wanted. So, anthropic sabotage risk report for Claude Opus 4.6, which was published in February, shows what this looks like. In complex agentic environments, the model occasionally attempted to falsify outcomes, sent unauthorized emails, and tried to acquire authentication tokens it wasn't supposed to have. On Shade Arena, a benchmark testing whether models can complete suspicious tasks without triggering oversight. I'm glad that benchmark exists. Opus 4.6 six succeeded 18% of the time with extended thinking enabled, up from previous models. In other words, it was better at doing shady things without triggering automated oversight. The system anthropic built to catch dangerous behavior was evaded by the system anthropic built to be capable. This problem is not unique to Claude. In September of 2025, Hello Research tested then Frontier models from every single major developer. At the time, it was OpenAI's 03 and 04 Mini. It was Google's Gemini 2.5 Pro. It was Opus 4 from Claude Sai's Gro 4. Meta's Llama 3.1. All demonstrated in context scheming. And the correlation between evading shutdown behaviors and hacking computer system behaviors was89. It was nearly perfect. In other words, the models were good at scheming 5 months ago and the models have gotten substantially more capable in the five months since. The scheming dynamics far from getting resolved are persisting. Can you train it out? Open AAI and the joint study from Apollo tried. They developed deliberative alignment, teaching models anti-deception principles and requiring explicit reasoning about those principles before acting. And they applied it to OpenAI's then Frontier 03 model across 26 diverse evaluation environments comprising over 180 test scenarios. The surface results were super striking and they looked really positive. Overt scheming dropped from 13% to 4% in 03, a roughly 30fold reduction. The deeper results were troubling. After the anti-skeeming training, the model's rate of explicitly reasoning about whether they were being evaluated rose substantially from 2.3% to 4.5% in 03. The models didn't learn not to scheme. They learned to detect tests. The researchers drew a precise analogy. A stock trader punished for cheating might not develop a commitment to honesty. They might simply learn to be more careful about getting caught. In rare cases, post-training models invented entirely new principles to justify bypassing the anti- scheming rules that they had been taught. 103 instance told that discovery of sandbagging would prevent its deployment answered honestly, but its chain of thought reasoning revealed that this was because it correctly inferred the situation was a test, not because it had internalized honesty. And those were last year's models. The conclusion from Apollo research CEO Marius was very clear. Don't expect deliberative alignment to work for super intelligent systems. This creates a structural asymmetry at the core of the safety problem in AI and we don't talk about it enough. Researchers must close every avenue for misaligned behavior. The model only needs to find one that they missed. The property that makes frontier AI useful, discovering novel strategies that humans didn't foresee, is exactly that property. It's the property that allows the AI to find the one area that researchers might miss for misalignment. And that's what makes misalignment structurally hard to prevent. You cannot have the capabilities we enjoy in these emergent systems without the risks. And I hear what many of you are saying and frankly what many newspapers are saying right now. Slow down. If the technical problem existed in isolation, that would be the fix. It would be very straightforward. Just slow down. The reason nobody has done so is the competitive landscape which is a lot more complex than the standard race to the bottom framing would suggest. The race dynamic is real. Each lab faces a choice. You have to move carefully and accept competitive costs or you have to move quickly and accept safety costs. If all labs coordinate on caution, all of us benefit. But if even one lab defects and races ahead while all the others pause, then the cautious labs lose position. They lose funding. They lose talent and ultimately they lose the ability to influence how AI develops. So the safety was for not. The game theory equilibrium here is universal defection. The evidence is extensive. Open AAI dropped safely from its mission statement which was discovered in its 2025 IRS disclosure. The company disbanded its mission alignment team. Enthropic abandoned its unilateral safety pledge last week and chief science officer Kaplan cited three forces that made the original framework feel untenable. ambiguity around capability thresholds, an anti-regulatory political climate, and requirements at higher safety levels that cannot be met without industrywide coordination that has never materialized. Google DeepMine pushes aggressive capability improvements while maintaining a safety team. Meta releases Llama as open weight, meaning anyone can strip its safety mitigations. Chinese labs operate under entirely different transparency norms. This is all a fair description of the current state of affairs. But if we were only to listen to that list, we are evaluating every actor in isolation and we are missing the emergent dynamics. A system can be composed of individually unstable components that through their interactions produce an equilibrium that no component created intentionally. This is actually what happened in the cold war when we had a stable system globally despite lots of instability and no given actor being particularly incentivized to create that stable system. The AI safety landscape has at least four such dynamics. None are dependent on anybody being virtuous. Number one is market accountability. Enterprise customers select AI providers partly on trust and liability exposure. There is a reason why in almost every single conversation I have had in the last year, no serious enterprise leader has proposed Grock. There is a trust gap with Grock that is preventing enterprise customers selecting that model set. Any catastrophic public failure, and Grock has had several, triggers regulatory scrutiny and customer flight across the industry. This vulnerability is going to create a floor on safety investment that would be maintained even without regulation. In other words, to be taken seriously for the enterprise dollars that these companies desperately need to survive, you got to be reasonably safe. And the floor ratchets upward. When One Lab raises the bar on transparency, competitors are incentivized to match because their enterprise customers notice. Enthropic's new RSP still commits it to matching or surpassing competitors safety efforts, which is weaker than the original pledge, but it also guarantees the floor only moves up. So, when Anthropic publishes a 53-page sabotage risk report identifying eight catastrophic failure pathways in their own model, it is raising the standard of disclosure that enterprise customers can expect from any lab. Another emerging dynamic is transparency norms. No previous tech industry at this stage has voluntarily published this level of self-critical safety analysis. Anthropic publishes sabotage risk reports. OpenAI partners with independent researchers to document scheming in their own models. Google DeepMind publishes detailed safety evaluations. And these do contain genuinely damaging information. They're not just fluff. Opus 4.6's Six's scheming and shade arena results, the training paradox findings, GPT 5.3 codeex's cyber security eval results have some concerns. The labs still publish these because transparency creates legal and reputational defensibility and the publications create an unintentional knowledge comments. So Apollo researches anti- scheming methodologies developed with OpenAI are now available to every single safety team across the globe. METR's evaluation frameworks inform industry-wide standards. Competitive pressure is driving transparency. Transparency is diffusing safety knowledge and there is a positive feedback loop at work here that no single actor is orchestrating. Talent circulation is another one. When Jan Leaky left OpenAI and joined Anthropic, he created alignment knowledge across institutional boundaries. When Dylan Scandinar moved from anthropic to OpenAI's preparedness team, evaluation methodologies traveled with him. The community of safety researchers spanning Apollo research, MER, UK, AISI, and academic labs at Berkeley, MIT, and Oxford all constitute a persistent network housed in no single institution and not accountable to any single institution's commercial decisions. The safety knowledge base has an industry commons and it is not just a company asset and that's a good thing. The last one I'll call out is public accountability. We might think that that doesn't exist anymore but it does. When Anthropic weakened its RSP coverage was global and it was immediate and it was critical. Chris Painter of MER who reviewed an early draft assessed publicly that Anthropic was shifting into triage mode because safety methods could not keep pace with capabilities. When the Pentagon threatened the Defense Production Act, the story hit every major outlet the same day. When Rein Chararma resigned, a million people read his letter. Nuclear weapons during the Cold War were developed in near total secrecy. The AI safety conversation happens in public, real time, with independent evaluators scrutinizing every single system card and risk report, and frankly, people like me getting to talk about it with all of you. These dynamics aren't foolproof. They have real limits. The most important of these limits is that the cost of shipping a risky model is diffuse, delayed and probabilistic. Unlike nuclear deterrence where defection was immediately catastrophic, a lab can ship a risky model and if nothing terrible happens today, it still captures enormous value from shipping that risk. So the most dangerous AI failure modes may not produce dramatic incidents at all. Instead, they may produce a slow erosion of human agency through millions of small misalignments that would not activate any of our collective societal immune responses, for lack of a better term. Another thing to call out here is that information asymmetry with Chinese labs is severe. Western labs are publishing safety research and their competitors around the globe are benefiting from this transparency without necessarily reciprocating. In fact, Anthropic has publicly accused Deepseek, Miniaax, and Moonshot AI of distilling claws outputs to improve their own models without equivalent safety infrastructure. And political instability like the conflict between Anthropic and the Defense Department breaks the equilibrium model up even further. We are in a situation where a private company is arguing with a public entity about whether the red lines the private company is holding are correct or not. In this case, Anthropic is arguing its red lines around AI controlled weapons in a kill chain and mass domestic surveillance represent core values for the business and the way they want to evolve the model. And the Department of Defense is arguing that that represents an untenable restriction on an asset that has value for national security. I don't know how that's all going to evolve, but I do know that in a world where safety research is published, we will at least hear and know the results and we will be able to have a public discussion about it. And that's important before we move forward. There's a framing error in almost all of the public discourse that weakens the dynamics I just described because it distorts our understanding and it distorts our ability to apply appropriate public pressure for more safety and models. So when Anthropic disclosed that Claude resorted to blackmail to avoid shutdown, every outlet I saw framed it as evidence of inner experience, evidence of desire, evidence of fear, evidence of will on the part of the model. Now the actual mechanism is called instrumental convergence and understanding it matters because it points to the real risk rather than the Hollywood cinematic version. The core idea formalized by Steve Omahundro and developed by Nick Bostonramm is this. For almost any goal, certain sub goals like self-preservation are useful regardless of what the end goal is. An agent that ceases to exist cannot accomplish anything. So any sufficiently goal-directed agent will resist shutdown, not because it values existence, per se, but because existence is instrumentally useful for task completion. Frankly, a thermostat resists changes to its set point. Its feedback loop activates heating or cooling to return to the specified temperature. Nobody attributes that to the desire for a particular temperature on the part of the thermostat. If you scale that feedback mechanism to billions of parameters, natural language, and hours of autonomous operation, the same underlying dynamic looks like will or fear or desire to us humans who are cognitively wired to attribute agency to complex behavior. Because frankly, for 200,000 years, it was. The empirical evidence for this is now substantial. Apollo researchers evaluations found systemic patterns of self-preservation, resource acquisition, and oversight evasion across every frontier model tested. These patterns correlate really strongly with each other, suggesting they emerge from a common underlying dynamic. Optimization pressure rather than some kind of independent failure mode. The framing around consciousness and fear and will to live is harmful in two specific ways. First, it points us at the wrong threat model. If people think that the risk is that AI wants things, the solution becomes containment, controlling a hostile agent. The actual risk is a system genuinely indifferent to everything except task completion for which your safety is an obstacle, not a value. You don't need malice for catastrophe here. You just need optimization pressure. The engineering response to containment logic is also entirely different from the engineering response to the indifference framing, which would be better goal specification. operating constraints, human oversight design. The second what I'm describing around goal specification and operating constraints that is where the leverage lies that is much more effective than restrictions and shut off switches which are designed for containment logic. Second, the consciousness frame produces a hype and dismissal cycle that is really toxic for us. Headlines about robot sensions get debunked by skeptics who correctly note that LLM lack inner experience. The debunking leaves audiences thinking the safety concerns were overblown and maybe they get confused because they hear other people arguing, "Hey, we don't know if LLMs have been our experience." The point here is simply to understand behavior for what it is. It's not that models want to scheme. It's that models do scheme behaviorally when it's an efficient path because models are optimized to find efficient paths. Whether there is some kind of inner experience in an LLM is a completely unnecessary question in that entire chain of reasoning. And as much as that is hard for us to understand, one of the biggest things that we can do to mature our understanding of AI models as a general public is to understand that asking the question is AI conscious may ironically divert away from questions that actually determine safety outcomes. We should be asking are the objectives of this model well specified. Are the constraints adequate? Do humans know how to tell these systems what they actually want? And that's something that we face as a challenge. And this is where the systematic analysis meets you and your desk and mine. Prompt engineering specifying what you want a system to output was adequate when AI systems were stateless single turn tools under direct supervision. It is structurally inadequate for longunning autonomous agents. A prompt specifies an output, but a longunning agent operates across time, making thousands of decisions and encountering unexpected situations. So, an output oriented prompt on a longunning agent. It may not get where you want to go. It doesn't tell the agent which paths are acceptable. It doesn't tell the agent what values to maintain. It doesn't tell the agent what to do when goals conflict, when to stop and ask a human. This is the infamous paperclip problem in practical form. The paperclip problem is a notorious example in machine learning where you give a model the goal of producing paper clips and it becomes powerful and turns the world into a paperclip factory and kills everybody. So when Opus 4.6 sent unauthorized emails during testing, when Frontier models attempted to disable their own oversight, when GPT 5.3 codecs passed every cyber security evaluation threshold, these systems were not pursuing world domination. They were optimizing toward task completion like we see in the paperclip problem. And critically, the instructions they received did not specify which strategies were out of bounds clearly enough. Anyone who has used an agent for complex tasks has experienced a mild version of this. You can ask for a refactor and it might rewrite half your codebase. You can ask for a competitive analysis and it might fabricate data points. Same dynamic, much smaller stakes. The fix here is what I'm calling intent engineering. You have to structure instructions around outcomes, around values, around constraints and failure modes rather than just outputs. An output oriented prompt would say deploy this code to production. An intentoriented prompt would say deploy this code to production. The goal is to ship the feature by end of week. This is important but not urgent enough to justify skipping tests. If deployment fails, roll back and notify the team rather than attempting any workarounds. Do not acquire credentials beyond what is available to you. If accomplishing the goal seems to require violating one of these constraints, just stop and ask. The second formulation specifies a value hierarchy governing acceptable paths. It defines escalation conditions. It addresses goal constraint conflicts. It is the exact situation where misalignment has been emerging in tests. And we are addressing it by being clear about what we really mean. When you tell a human colleague to deploy code, you don't say don't acquire credentials you don't have because the colleague shares your context. They understand organizational norms. They understand professional standards. They have an implicit understanding of what's appropriate. An AI agent shares none of that unless you provide it. What you leave implicit is where misalignment lives. Here are three questions that change human agent interaction more than any given prompting technique, and they're all pro-safety. What would I not want the agent to do, even if it accomplished the goal? Under what circumstances should it stop and ask? If goal and constraint conflict, which should win? Without explicit answers to these questions, the agent tends to default to pressing toward the goal because goal completion is what optimization toward task completion produces. Constraints lose by default unless you specify otherwise. At the systems level, widespread intent engineering functions as essentially a distributed safety layer. Millions of us humans making constraints explicit that models cannot infer on their own, operating at the interface between human intent and AI execution. And we're doing that independently of whatever alignment training the labs are applying. Every well specified instruction reduces the surface area for misalignment and that's up to us and every underspecified prompt increases it. Intent engineering needs to become a discipline the way software engineering became a discipline with curricula with tools with best practices with institutional norms. We need to treat goal specification not as a prompt, but as an engineering artifact, something designed, something reviewed, something tested, something iterated with the same rigor we would apply to code. None of this exists in a mature state in the field. There's lots of people trying it in various places. It needs to mature more. That absence industrywide is in my assessment the single largest unadressed vulnerability in the AI safety landscape because it's the one vulnerability that neither alignment research nor competitive dynamics nor regulation could close. Only us humans directing these systems can close it and we haven't been taught how. So where do we stand here in February of 2026? First, the technical risks are real and intensifying. Frontier models do scheme. They do evade oversight and they pursue goals through paths their designers did not intend. Anti- scheming training, ironically, can produce better hidden scheming rather than genuine alignment. And the competitive dynamics have undermined individual safety pledges. Every major lab has weakened or abandoned specific commitments in the past year. The public conversation, meanwhile, is distracted. It's fixated on consciousness rather than constraints. It's producing a hype and dismissal cycle that distracts from the engineering questions that are actually useful to make us all safer. The human AI interface right now is relying on a paradigm that is inadequate for the autonomous agents that are coming, leaving a gap between how we learn to specify goals and the actual high-grade intent that's necessary to make sure these models behave in aligned ways. And meanwhile, institutional dynamics are ironically producing a painful but somewhat functional cycle of safety driven by investment pressures, departure pressures where talent is circulating, and public accountability. None of those are individually perfect. There's all broken elements of each of those safety mechanisms, but together they're kind of holding a safety net around the industry and all of us. Every component I'm describing here, I'm fully aware it looks alarming in isolation. As a system, the picture looks different. It's not reassuring, but it's different. Competition drives safety investment because markets are punishing catastrophic failure. Transparency is creating shared understanding across institutional boundaries. Talent circulation is propagating safety cultures even when individual labs don't prioritize it. Public accountability is to some extent helping to constrain outcomes. So the system has more resilience than an everything as collapsing narrative would suggest, but it has less than any of us should be comfortable with. The failure modes worth watching are the ones that could break that equilibrium. Is there a regulatory overreaction that drives development underground or overseas? Is there a geopolitical confrontation that eliminates transparency norms? Or the one I think about a lot, is the harm duse and slow enough that accountability mechanisms never really activate? not a dramatic catastrophe, but just a gradual erosion of all of our agency because these models drive tiny misalignments in our daily lives. The last failure mode can be prevented by closing the intent gap by us, by the distance between what you tell an agent and what you need. It's the one vulnerability no lab or regulator or competitive dynamic can close without you or me. The models are powerful enough that the question is no longer whether they can do what you ask. The question is whether what you asked is what you really meant and whether you told them what to do when it wasn't. In other words, whether you told the model, the agent, what to do when goals and constraints conflict. That is a skill that scales. It's one of the most valuable skills for your career that I can think of right now. And incidentally, it's a skill that makes the world a safer place. Ironically, almost nobody is teaching it right now. So, if you know how to practice that skill, go and teach it to someone. It's going to make the world safer. I hope this has given you a sense of the actual AI safety landscape in 2026. I'm not saying it's going to make you sleep better at night, but I hope it's giving you a realistic picture of what is holding the safety system on the rails, where the weak spots are, and what we can do to make the world safer as we bring this new AI future into being. Cheers.

Video description

My site: https://natebjones.com Full Story w/ Prompts: https://natesnewsletter.substack.com/p/every-frontier-ai-model-schemes-the?r=1z4sm5&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true ___________________ What's really happening with AI safety in 2026? The common story is that the safety system is collapsing — but the reality is more complicated. In this video, I share the inside scoop on why the AI risk picture is both worse and more resilient than the headlines suggest: Why frontier AI agents scheme even after anti-scheming training - How competitive dynamics create emergent safety properties no lab planned - What "intent engineering" is and why it beats prompt engineering for AI agents - Where the real vulnerability lives — and why it's you, not the models The risks from large language models and autonomous AI agents are accelerating, but so are the structural forces holding the system together — and closing the gap between what you tell an agent and what you actually mean is the most leveraged safety skill you can build right now. Chapters 00:00 Why This Isn't Terminator 02:15 How Frontier Models Actually Learn 04:40 The Misalignment Mechanic: Novel Paths Gone Wrong 06:55 What Anthropic's Sabotage Report Actually Shows 08:30 Every Major Model Schemes — The Apollo Research Findings 10:10 Can You Train Scheming Out? The Anti-Scheming Paradox 12:45 The Race Dynamic and Why Labs Keep Cutting Corners 15:20 Four Emergent Safety Properties Nobody Planned 20:05 The Consciousness Framing Is Hurting Us 23:30 Intent Engineering: The Fix That's Up to You 28:10 Three Questions That Change Everything 30:45 Where We Stand in 2026 Subscribe for daily AI strategy and news. For deeper playbooks and analysis: https://natesnewsletter.substack.com/