Why AI Infrastructure Fails Before It Even Starts | Rob Hirschfeld, RackN

TFiR · 74 views · 2 likes

Analysis Summary

40% Low Influence

mildmoderatesevere

“Be aware that the 'AI Factory' terminology and the framing of 'inevitable failure' are used to create a specific problem-space where the guest's product is the only logical solution.”

Transparency Transparent

Primary technique

Human Detected

95%

Signals

The transcript exhibits clear hallmarks of a live, unscripted human interview, including natural verbal fillers, spontaneous reactions to the host's questions, and non-linear sentence structures. The content is a standard professional B2B interview format with genuine interpersonal dynamics between the host and the subject matter expert.

Conversational Fillers Transcript includes natural speech disfluencies such as 'Um in in generally' and 'the the the going term'.

Dynamic Interaction The guest (Rob) responds directly to the host's specific analogies (factory floor/widgets) with nuanced corrections and industry context.

Syntactic Complexity Sentences contain mid-thought pivots and self-corrections ('It's it, you know, it's good how as a frame of reference') typical of live interviews.

Domain Expertise Specific references to 'bare metal', 'NIC misconfigurations', and 'OEMs' are integrated naturally into the flow of conversation.

Worth Noting

Positive elements

This video provides a rare, detailed look at the physical layer-1 networking and hardware configuration challenges that actually delay AI deployments.

Be Aware

Cautionary elements

The use of 'revelation framing' makes standard enterprise scaling hurdles sound like a unique industry secret that only the guest can decode.

Influence Dimensions

How are these scored?

About this analysis

Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.

This analysis is a tool for your own thinking — what you do with it is up to you.

Analyzed March 23, 2026 at 20:38 UTC Model google/gemini-3-flash-preview-20251217

Transcript

Enterprises are racing to build AI infrastructure. We are talking billions of dollars in GPU clusters, massive data pipelines, and bare metal systems that need to be up and running yesterday. But here is a challenge. These AI factories are some of the most complex infrastructure deployments we have ever seen. is multiple GPUs per server network and one misconfiguration can install a multi-million dollar project for weeks. So how do you get AI infrastructure right the first time at scale without the pain? That is exactly what we are going to talk about in this episode of Agentic Enterprise with Rob Hersel, CEO and co-founder of Raken. Rob, it's great to have you back on the show. It's a pleasure as always. >> Let's start with a term that is getting a lot of attention. AI factory. What exactly is an AI factory and why is everyone talking about it? >> AI factory is used in a couple of different ways. Um in in generally what we see it meaning as this idea that you want to have a standard footprint to do AI inference um and training probably more training even than inference. Uh but the idea is that you're you're creating a known quantity that you can reproduce that generates AI workloads and automation and and that's fundamentally sort of the unit that we're thinking of. So instead of AI cluster or AI um workstation or AI workload, the the the going term seems to be AI factory. >> So how does it encapsulate all AI related operations like a factory floor? You've got supply chain raw materials coming in the production line and then the output. Is that how we should think about it? >> You're right. There's a challenge that we have because we are so close to the bare metal and how people look at buying servers and things like that. So when when people think about an AI factory, they very much mean the physical plant that creates AI, which is where the factory analogy comes in. It it is a little bit confusing because it's not like an assembly line and you're making widgets and things like that, but it is the equipment that you buy, the location that you have that produces AI results. So if you're a model builder, then that AI factory is a training is racks of training gear. If you're an inference system, then that AI factory is racks of inference engine pieces. And it's it's very much in all the OEMs that that we talk about and the AI builders are all thinking in this terms of I'm building an AI factory. Now, it could be thousands of of machines. It often is, but that's the idea on what they're building. What we have seen though is that a lot of times that ends up being one cluster at a time. So somebody might be building a AI training cluster and think of that as an AI factory when they've got 10 clusters. So they might still think of those clusters as 10 factories AI factories where they're building an an output out of the system. It's it, you know, it's good how as a frame of reference because it it does talk about the mechanical physical plant that you need to accomplish AI versus running a model or doing inference or having an API. Those are those are the differences that we see when we're talking to somebody who actually wants to be running an AI system as opposed to using an AI system. Is AI factory a well-known term or it is something that is being coined right now? Sometimes we create terms to solve specific problems or build better understanding like dev secc ops, devops, sres. Can you talk about the nuance behind this term? The idea of coining a term is is an interesting thing. And we've had times when somebody like when we first started doing edge, there was a lot of um misplaced uh people trying spending a lot of time defining what edge meant and cloud is the same type of thing. There was push back in the idea of calling it cloud initially. So it's not something that rack is trying to term. This is something AI factories are a request that we get over and over again. So when we talk to hardware OEMs, when we talk to partners, when we talk to customers building, they are using the term AI factory. So so it is an emergent term for AI hardware and infrastructure. Um and and what that build looks like, which I which I think is important because we're we're different than saying an AI cluster. Uh that's a that's that's typically not as clear as somebody saying, "Oh, I'm actually buying the gear. I'm actually running the infrastructure." >> When you look at organizations building AI factories, Nvidia, Tesla, companies ingesting massive amounts of data in real time, these teams are rushing to get infrastructure up and running. And we are talking bare metal here, not virtualized systems. So walk us through the pressure these teams are facing and what are the resources available and of course what's the play here for Raken. >> Yeah, there's there's two things that you said that I really want to highlight. One of them is talking about data and and we all know that uh data is absolutely critical for AI training. Um and and the performance of the systems to ingest data is actually part of the reason why these are AI factories because it's not as simple as taking a traditional data storage system or a data lake and you know hooking it up to a switch and then attaching servers to it. The the AI factories that we're talking about are highly tuned systems that are all put together um to accomplish a specific task. So we're building AI factories to suit um or the industry is building them to suit what what Raken is doing and this is comes up to your second the second part of your question which is the urgency is uh our automation is being used to drive the delivery of the infrastructure in ready to go state of these AI factories. So just because you have an architect um what we usually call at Rakana layer zero design. So an architect designs the networking topology and the smart nicks and the machines and the GPUs and all of that layout is done. Even though that's done and the people the teams above that the AI ops and the platform teams you know they know what they're going to lay down and they know what models they're going to run and they know how they're going to do it. What we're finding is these are incredibly complex systems. They're very expensive. The the need to get them in market and used is incredibly urgent. Right? The c our customers who are building AI factories, they take delivery of racks. They don't even know in advance what's going to show up when. They just get them as fast as they can possibly get them. They rack them. They need them running yesterday. Um they're they're incredibly expensive uh systems. We've talked about this a bit already, but it's it's not just the cost of the systems, right? quart million dollars per server, you hundreds of millions of dollars per per rack, billions of dollars per rack. It's the fact that the opportunity cost of not having these clusters running is significant. And so what we're finding is the automation necessary and this is where Raken's specialty comes in this layer one automation to qualify the system inventory it detect issues patch and update it set the networking topology correctly get all of the bits and pieces of security the operating systems installed so that it can join the AI clusters all of that work that has to be done it's incredibly detailed work. There's a lot of expertise in that network topology and having multiple nicks and how everything gets wired together and how smart nicks are configured. Um, what we're finding is that the the precision needed to reliably automate this gear is something that digital rebar was purpose-built for, but our customers, you know, don't have the tooling. They don't have the expertise to just walk into a data center and have these racks up and running. The ones who've done it manually um just to get their prototypes going then find that manual effort doesn't scale and and you have to learn how to make all this stuff work. You folks deal with bare metal and bare metal is always being challenging. New hardware versions, software updates, things can go tricky fast. So walk us through how complex these AI systems are compared to regular data center deployments. what makes them different and what are the additional complexities that is where Raken enters the picture to help organizations tackle those complexities. It is it's it's a little bit of a paradox because at some point these are just servers and yet you know if you put smart nicks and GPUs in a system especially when you're putting multiple smart nicks and multiple GPUs those are also functionally servers and what we've seen in and then there's also if you're using a neocloud and a lot of our customers are the neoclouds are delivering the systems and then our customer takes over control of the systems And so what what you have is somebody's delivering you a server that's going through their provisioning process. What we've seen is that sometimes the NeoClouds are going to keep control or keep a boot provision component running in the system. All they need to do is deliver the system to you. They're not worried about any other configurations. Right? Our team spent over a week troubleshooting the fact that each nick on the system was a server and those servers were grabbing DHCP and responding to IP addresses and and interacting with the overall provisioning network and the the the topology of the system independently of each other. It's something that we walked in diagnosed very quickly. Somebody said, "No, no, no. It can't be possibly be the fact that these smart nicks are still engaged or still responding." And we, you know, went down a different diagnostic path and we had to come back and and retroubleshoot to figure out exactly what was going on because people insisted that they knew all of the components. But when you have this many pieces and parts in a server, um, what you have is a lot more potential conflicts. You have a lot more potential places for configuration. you know a lot more places for conflicting configuration and and systems that somebody thinks oh this isn't important so I ignored it earlier down in your pipeline of delivery and then when the system got turned on those those you know disabled or ignored components start interacting and that beginning interaction then causes problems. This is true for out of band management and being able to reassign credentials. This is true for how pixie boots are handled. Secure Bixie Bixie Booting is very specific. Um these challenges are also enterprise challenges. The difference here is that you are much more likely to hit these concerns because you have so much more going on in the server and you're in a big rush to do it. And that that's one of the challenges is that you know we're talking to customers and they're they want these servers on yesterday and they're they're trying to meet very strict deadlines. They're paying a lot of money to get systems. There's a tremendous amount of pressure and that human wise, the urgency to make things happen, trying to get things going, you know, that creates its own process problems. Um there's a there's a Marine a Marine Corps saying which is uh slow is smooth, smooth is fast. And what we see over and over again is the the pressure, the deadlines, the urgency to get these systems up and running actually undermines teams ability to troubleshoot, take stepwise, put in systems and processes that then speed them up because it's not just can you deliver the system today. Every customer we've seen doing AI workloads, frankly, everybody we've seen doing virtualization workloads, they benefit from being able to do uh apply, reset, patch, and and you know, return systems in sort of like a a race car pit crew. Uh you can, you know, run your systems as fast as you want to, but they have to come in for the pit. They have to get updated. If your pit doesn't do a good job updating, patching, you know, getting the tires rotated, then this you're going to have times where servers just, you know, a whole cluster goes out because a server isn't patched and updated and you haven't practiced those routines. You haven't you haven't gotten established repeatable process to do it. We've watched customers really resist the need to get those types of processes built. The ones that resist take months to get things going. the ones that lean in and make it happen, they're the ones that actually win in the end because they've done the work ahead of time to have a repeatable process. >> When you work with organizations, have you seen any self-inflicted wounds? What are some mistakes teams are making that you're like these are so easy to avoid and they are bringing past practices that don't apply anymore or they build new wrong practices that should not be there in the first place. The point is what are some of the mistakes that they are making they should not make. >> I hope people take a lesson more broadly as they're doing AI work in general is right they're finding that you can recreate the whole system from the sources over and over and over again as I you know ICP infrastructure is code. Uh this is the same principle right that you want to you don't want to build everything and then hope you never have to change it. The goal is to have a repeatable way to recreate the whole system and the whole process and and spending the time to do that and iterate through those things that translates into velocity in the end. Um you know it's one of it's one of those challenges. The other thing that we found uh that's worth mentioning is you know we have really deep bespoke knowledge about how bare metal infrastructure works and with you know that knowledge is something that we've coded into digital rebar but it's also something that we can show up for a customer and you know get things going get them over hurdles move them past uh you know challenging in configuration scenarios that might have otherwise taken weeks to unravel our team because of our unique experience can can look past that and move straight to this root cause. Um it's just a you know an aspect of us being the world experts in bare metal at this point. One of the ones we see over and over again is that the the the teams forget that there are knowledge layers in in the organization and in their in their silos. But one of the problems that we see happen is that the platform team, so the people who are sort of running the model, trying to run Kubernetes, do AI ops, those teams make an assumption because they're using the models that they're going to be able to run and set up the infrastructure. And and we see this problem. This isn't just an AI problem. We see this in enterprise quite a bit with with Kubernetes running on bare metal for AI. A lot of AI p not AI uh virtualization pilots and AI pilots um are stalled when the team that's trying to run the AI the AI ops team or the virtualization admin teams they don't have the bare metal expertise but they brute force their way and they end up doing a whole bunch of manual and they're learning bare metal at the same time they're trying to get their pilot running and and they're doing it because they want to just make the pilot work and and and be successful. But we've seen a lot of derailed pilots where they're taking months and months to get started or get the pilot complete so they can actually demonstrate the platform on top of it. Uh and that's really frustrating because it means that the the uh their management confidence in their ability to execute is off. You know, a lot of companies are trying to get off VMware and and those projects have taken a lot longer than they want to because people don't have the expertise to do, you know, these alternate system setups anymore and you have to bring in the teams that are responsible for running your physical infrastructure or you have to find people like Bracken who can actually come in and help and advise on how to do those or bring in proven processes so you don't have to reinvent those processes. That's it. It's it's very hard to shortcut what those things are. And it's, you know, really sad that for us, we have some customers that trust us very deeply. They still fall in the trap of saying, "Oh, I want to let my open shift team take a, you know, take a swing at this and try and figure out how to do it." And it's like, well, you're going to be using the ops team and digital rebar anyway. Why don't you just let us help you at the beginning? and they still wall off that team and try to help them be successful as an isolated thing. And this isn't just one customer. I've watched this pattern repeat over and over and over again. And so it's it's this problem of, you know, organizations are siloed. Operations teams are separate from platform teams. They don't talk very well together. And you know that lack of communication, that that lack of sort of helping each other succeed, thinking about the long term, you know, frankly, it slows projects down. It we've seen it result in a lot of mis delivery objectives. We've seen it result in a lot of VMware renewals. We've uh watched it result in a lot of AI studies and applications that are stalled. uh and you know when people look at at running their own AI infrastructure and and management thinks they can't do it, it's because of problems like that. It's it's completely untrue. We we have very successful companies that are running AI at scale and delivering systems very quickly and they're the ones that have the strongest process discipline. If you are advising an enterprise that is about to deploy AI infrastructure, what could they do upfront to streamline the process in addition to working with RK? Of course, >> the number one thing you can do is h to set aside systems if you can. And some of these systems are so expensive you can't actually set them aside. But is to make is to take the time to actually walk through what the process is. Understand what your automation targets are and how all of those pieces are going to fit together. And do it as a team. It's going to feel slow. It's going to feel like you're getting a lot of people involved. You're you're going to have to study how these systems fit together. And it's going to you're going to you're going to feel like you're you're you're learning and you're spending a lot more time studying your systems and understanding things. But if you don't do that, you're going to you're not going to be able to replicate success. And and I think that's the advice I would give somebody is that is that make sure that as you're going through this process, you understand how you got to a successful outcome. If you if you don't have a way to replicate what your successful outcome is and then tear it back down and start over and go through those steps again, and that can be hard if you're in a rush, if you can't replicate success, then you actually don't know why you were successful. You just got lucky. And and I think that is the the thing that that I would tell people to think about, you know, as they look at at any IT work, but especially these AI workloads, is that you need to make sure you understand how you got to the place you want to get and be so confident in it that you can flush everything and and start over. And if you're an executive and you want to test this theory, what you want to do is ask your your people, all right, you got it working. Reset it back to zero and show me the whole process from nothing all the way through again. And if your team turns, you know, starts looking very scared and nervous and and doesn't want to do that, then they don't actually know how they got to the point of success. They're giving you a demo and the demo might be fantastic, but with without understanding how they did all the work to prep that, then you actually don't have a repeatable project. You've you've gotten lucky for AI in the moment. And at the pace AI is moving, luck is just not enough. >> Let's talk specifically about how Raken is helping teams get this right. What is your approach to tackling this complexity? Things are evolving fast. MCP was launched last year. New versions coming out. Uh, Indian Techch Enterprise Foundation is also there. New models keep coming out, new services. So, how is Raken helping teams get their AI factory ready and running? >> Yeah, we're not just stamping AI all over the product and and calling it good. Um, and we're not we're not trying to pretend that a whole bunch of embedded AI is is going to help anybody um win in this model. I I actually think that doing that would would slow down the type of work that we're doing. You know what what we're what Rackend does is we provide out of the box bare metal automation workflows. They're they're battle tested. They're timeproven. They operate at scale. We've spent the last 10 years, you know, banging through what it takes to make all this stuff work. And we built very concrete expertise in handling all of these systems, right? We had a customer show up and say, "Hey, by the way, we're buying these new servers and they're all ARM." Um, instead of, you know, Intel or AMD, they're all ARM servers. And our systems had already supported ARM. And so we were able to roll that in and it wasn't it was a non-issue from the customer's perspective. But that type of expertise and knowledge and having all of our proven workflows work in an ARM environment, that is something that had to be done with deliberation and care and tested over a long period of time. And so when we work with customers here, what we're bringing is tremendous amounts of proven scale workflows where we have handled the quirks where we know how to migrate between one version of Redfish and another version of Redfish or mix and match uh OEM versions of controls and Redfish versions of controls. that blending of all these components of being able to deal with smart nicks in an elegant way and integrate them into the delivery of how a system works, deal with the networking topology and and have very advanced image deployment capabilities that respect the number of disks and drives that are being loaded into these things. Every one of those components, all of those edge conditions, those have to be handled in a correct routine way. And that's really what what we've built up is standardized workflow that is APIdriven that performs at scale and frankly out of the box you know manages the all of the equipment that's getting thrown at us. Um because in a lot of cases our customers like I said earlier don't even know what gear is going to show up. If if the Dell server shows up before Super Micro then great. If that shows up before quanta, then whatever's racked, that rack, you get it, you inventory, and you go, you don't have time to say, "Wait, I'm not ready for the Dell gear yet. Put that one on ice for a couple of weeks while we deal with something else." Right? Nobody has time to figure this stuff out on the fly. And that's fundamentally what we're bringing is the fact that we've already done that work. We've already established the processes. And as new gear, new things get thrown at us, digital rebar is already ready to handle the variations that occur naturally in the environment uh without having custom code without people having to figure it out on the fly. >> This year is all about AI and agentic AI. What's one thing you feel enterprises need to understand about AI infrastructure that they are still missing? People get overhyped, some are scared. What do you want them to know? This is the other part of the AI strategy that that was in your question and I should have answered more clearly. We see agentic AI as absolutely critical to enterprises and what we see happening is uh enterprises having their own agent systems right no people don't want a rackend agentic management system what they want is they want rackend and this is exactly what we're doing to take the time to build an MCP or a prompt dictionary so that that AI knows how to use digital rebar are effectively so that you can issue highle commands into your agent and the agent will take the correct action in digital rebar and get a reliable result that that that deterministic output rather than your stochastic par matching. And so what we expect to see is enterprises are going to increasingly invest in their own agentic systems. They're going to rely on one of the the frontier models. They're going to build agent systems and they're going to need to manage that because enterprises are going to live or die on the quality of their agent responses and the guard rails that they put together for it. Racken's job is not to build an agentic system are there already enterprises that are building them or buying them to use. Our job is to make sure that those agents have a way to interact with data center infrastructure in a reliable, repeatable way and make that as smooth as possible and continue to build our team's expertise in helping you troubleshoot issues that arise from agents talking to systems setting up. Right? there's a whole new diagnostic skill coming which is um and I and my CTO and I were having this this exact experience of having Claude using digital rebar in a way that no human would do but was perfectly natural for a machine and causing issues and having troubleshooting as a consequence. We have to be ready and and Raken is ready. I think your listeners have have this challenge. They they they know they're going to have agents. They're going to need to control those agents. And then they have to be able to diagnose and troubleshoot the problems of the agents, steer them into the right outcomes, provide guard rails for them, and then figure out what happened when the agents take actions that while logical are not correct and unwinding that, understanding that. Those are the skills that Raken is building into how our systems operate so that we are the the proven choice for making our enterprises digital workers as effective as possible. >> Rob, thank you so much for joining me and sharing these insights on AI factory infrastructure bare metal complexity and how critical it is to get repeatability right from day one. And as usual, I look forward to our next conversation. Thank you. I appreciate the time. Thank you so much. >> And for those watching, if you are facing similar challenges with AI infrastructure and deployment, make sure to check out Raken and it solutions. And don't forget to subscribe to TFR, like this video, and share it with your teams. Thanks for watching.

Video description

Most enterprises racing to build AI factories are setting themselves up for failure — not because of bad intentions, but because they underestimate the brutal complexity of bare metal at scale. Rob Hirschfeld, CEO and co-founder of RackN, has seen it firsthand: multi-million dollar GPU clusters stalled for weeks over a single misconfigured NIC, platform teams brute-forcing bare metal without the right expertise, and organizations that simply don't know how they achieved success — meaning they can't replicate it. In this episode of Agentic Enterprise, Rob breaks down what an AI factory actually is, why urgency is the enemy of repeatability, the most common self-inflicted wounds teams make during deployment, and how RackN's Digital Rebar brings battle-tested bare metal automation to get AI infrastructure up and running — reliably, at scale. Read the full story at www.tfir.io #AIFactory #BareMetalAutomation #AIInfrastructure #RackN #DigitalRebar #AgenticAI #EnterpriseAI #GPUCluster #CloudNative #PlatformEngineering