A Zettabyte Scale Answer to the DRAM Shortage

ServeTheHome · 184.8K views · 2.7K likes

Analysis Summary

30% Low Influence

mildmoderatesevere

“Be aware that the 'access' to exclusive labs is a structured marketing environment provided by the sponsor to showcase specific products rather than an independent industry overview.”

Transparency Mostly Transparent

Primary technique

Human Detected

98%

Signals

The content is presented by a known human creator (ServeTheHome) with clear evidence of on-site reporting, personal sponsorship disclosures, and natural, non-synthetic speech patterns. The technical depth and specific references to past channel history confirm human authorship.

Natural Speech Patterns Transcript contains natural filler words ('uh'), self-corrections, and conversational phrasing ('Hey guys', 'let's face it', 'right?') typical of unscripted or semi-scripted human speech.

Contextual Anecdotes The narrator references specific personal experiences, such as being flown to Marvell Labs and previous videos made in 2021, which indicates a persistent human creator identity.

Technical Nuance and Analogies The use of specific, non-generic analogies (Infiniband vs Ethernet cables) to explain CXL protocols suggests deep domain expertise rather than LLM-generated generalizations.

Worth Noting

Positive elements

This video provides a rare physical look at CXL Type 3 memory expansion hardware and explains the technical distinction between PCIe and CXL protocols clearly.

Be Aware

Cautionary elements

The video conflates a technical standard (CXL) with a specific vendor's implementation (Marvell), making the vendor's unique features seem like standard industry progressions.

Influence Dimensions

How are these scored?

About this analysis

Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.

This analysis is a tool for your own thinking — what you do with it is up to you.

Analyzed March 23, 2026 at 20:38 UTC Model google/gemini-3-flash-preview-20251217 Prompt Pack bouncer_influence_analyzer 2026-03-08a App Version 0.1.0

Transcript

You might have seen recent DDR5 memory prices go absolutely crazy. And for hyperscalers, that's a big deal because they buy so much memory. At the same time, they also have zenabytes of RAM sitting in machines that are going to be taken offline and they have a way to go and recycle that memory using CXL memory expansion devices. What we're going to do is we're going to go to the Marll Labs where they develop some of these high-end CXL devices, especially for those hypers scale customers, and how even though they may seem like they might be slower, they can actually speed up things like [music] AI workloads because, let's face it, AI is the name of the game these days. Hey guys, I want to say thank you really quick to Marll for sponsoring this video and flying us up here so we could actually go and show you this in their labs. You normally don't get access to these labs. And if you don't get access to these labs or hyperscalers, you normally don't get to see this really cool technology. So with that, let's get to it. Now, first off, let's start talking about what CXL is and some examples. We did this back in 2021, a couple years after CXL was introduced, really explaining the fundamentals of CXL. Now, if you don't know what CXL is, it stands for compute express link. CXL takes the lanes that we would have as normally like PCIe lanes and instead of running the PCIe protocol and really passing those packets across the PCIe link, instead what happens is CXL is a different protocol that's running over those same wires. So you can think of it like the physical layer is still those PCIe lanes, but there's a different protocol running over it almost the same way. If you think about it, you could have a cable that would do Infiniband or Ethernet similar to that. or you could have a fiber optic cable running a whole bunch of different types of protocols, right? But all of that is just to say that the physical layer is the same, but the protocols that sit on top of that are actually different, which means that you need CXL devices that can speak not just PCIe, but also CXL to be able to take advantage of those devices. Now, let me talk a little bit about CXL type 1, 2, and three devices. Type one, these are really the caching devices and accelerators. And you can think of that like we use the idea of having a nick that you have a nick and then it goes back to the CPU's host memory over the repurposed PCIe lanes that are running the CXL protocol and you can go back to the host memory and say like hey I need some data from that CPU host memory and you can use that by the accelerator right the type 2 device is more of an accelerator with memory and we've been using the GPU idea before GPUs were crazy cool in AI the idea here is that you have an accelerator like a GPU where you have that memory that's sitting on that GPU and then you also have a CPU where that can have memory as well and you have to go and deal with the fact that you have two different pools of memory or many pools of memory on the same you know host system right now the type 3 device used to be called the memory buffers but you can also just think about it as like memory expansion and that's really what we're going to look at today when we go to Marll's lab and look at the Marlla line but before we get there I think it's super important to just kind of conceptualize what needs to happen to be able to connect memory on a memory expansion device, a couple things need to happen. First, the host processor needs to be uh CXL compatible. And not just the host processor needs to be CXL compatible, but usually most host processors these days, not all of the PCIe lanes that they have are also CXL lane. So, usually only a subset also support CXL. Okay. So now that you have the host processor, you have the PCIe route that can also speak CXL, you have wires across to another device. Now on the other side, you need to have something that takes that CXL and translates that into your memory and really can host the memory, right? So you have processors like these. And what these controllers do is they have a memory controller. So they can go and, you know, either have DDR4 or DDR5. Then they take that and they put it onto CXL and they talk the CXL protocol and they speak that to the host processor. Right? So you have that connection between the host processor and the structer chip. And then the structer chip is what then talks to and controls the memory. And so let's get over to Marll and take a look at what that looks like. Now I'm here in the lab with both Structura X and Structura A. And we're going to look at some of the differences between those two. And in my hand I have Structera X which is for memory expansion and Structura A which is for acceleration. Now typically in the industry we've talked about CXL type 3 devices as being pretty simplistic like you know you add these devices into a server and you add memory and that's how a lot of folks think about them. But these actually do quite a bit more. So, for this I thought let's start with Stera X because that's probably closest to what a lot of folks think about when they think about simple CXL memory expansion except we do have one major difference. Now, here we have the two chips. This is the Strutterra X for DDR4 and this one over here is the Strera X for DDR5. You'll notice they're a little bit different sizes, but there's a good reason for that. The DDR4 version can handle up to four channels of DDR4 memory, but with a special trick in that it can have up to three DIMs per channel, which gives you 12 dims per controller. That's exactly what we have here. Now, the DDR5 version, of course, runs faster DDR5 memory, but on the other hand, you only get two DIMs or up to two dims per channel, but of course, you're running at faster DDR5 speeds. So, when you have to fit 12 DIMs onto a board that's like this, you have to have a little bit more slender of a package. And I think that's why the DDR4 version is a little bit slimmer. Okay, so let me level with you and just show you the Strera X solution, right? Because this is one of the big assemblies that this will get deployed into and for very good reason. This is a DDR4 Strera X. So, you can see that we have our 12 different dims here and these are all DDR4 dims. And the reason this is such a big piece is because you can put your old DIMs. So if you think about hyperscalers, they've been deploying DDR4 for years, DDR4 memory that's just sitting out there in servers that are being retired. So the idea here is that by having 12 different DDR4 DIMs on this board connected via CXL, you can add a lot more capacity, but not necessarily, you're not necessarily looking for the most performance out of this because it's still DDR4. And of course, if you really wanted to, you could have fewer dims per channel and probably get faster memory speeds, but this is the real configuration because, of course, you take your DDR4 out of your old servers. You put it in here, and you get to go and add a huge capacity tier. A lot of the hypers scale DIMs are maybe 64 to 128 GB, which makes this a really nice board just to go and get extra memory capacity that you can provision to different servers. Oh, and there's another benefit that is beyond just simply adding extra memory that you've probably seen in the market before. Structura X also does inline LZ4 compression of the data. So, what'll happen is the data goes between the host system and say this connector here and then into the controller. And that's just your normal data stream that you would just normally see out of a CPU. But then once it goes between the controller and the DIM modules, it can then be compressed. Now, I've been told that it's something in the range of maybe 1.8 to two and of course different different types of data and workloads and whatever will be different, but figure you get 1.8 to 2x compression, which means that if you have a 128 gig dim installed, well, that could potentially store up to 256 GB of compressed memory. And oh, by the way, you can then have 12 of those, which is almost like having 24 DIMs, but you only need 12. So, that's why this is so awesome, right? You get to recycle your old memory. You don't have to buy new memory. And you can also do compression on that memory so that way you get or compression on the data in that memory so that way you get even more data stored on these dims. And so hopefully you can understand why this is so attractive for hyperscalers, right? because they can go use things that would have gone into the recycle bin or a home lab or something like that and instead reuse them, not have to spend thousands of dollars for gigabytes of memory or terabytes of memory and instead they can just use something like X and get all of that benefit albeit at a slower memory tier. And of course, that's really reusing memory for DDR4. But there's also the Stripter X DDR5 version. So, if you want to have higher performance memory expansion, so you actually want to go and use DDR5, maybe even earlier DDR5 modules, and you want to go put those into other servers, you want to have that higher DDR5 performance, well, that's why you'd build something like this, but with Striptera X. But what happens when you want to do more? Because in a lot of applications, you don't just want more memory bandwidth. You need more compute to go along with that bandwidth. Well, that is exactly what this next to me is for. This is the Strutterra A package, which you'll see is very similar to the DDR5 Strutterra X package, but instead of just having a memory controller with that inline compression and all that, this has 16 ARM Neoverse V2 cores. And so that means that we have real server CPU cores that are built into the CXL controller. Okay, now I know you're thinking like, why do you even care about putting ARM cores on your CXL memory controller? like what what the heck is that about? Well, think about it this way. If you are really doing an application that you have to go and you need to do some data processing on, whether it's an AI application, whether it's data warehousing or some other type of application where you just need to go and process a lot of data that's in memory, well then if you're just adding more memory bandwidth and you're not adding compute, then you're really only solving part of the equation to getting more performance. And so this card in front of me is a Strateera A card. And let me show you why this is so exciting, right? This board is a lowprofile PCIe card and you're going to see has a giant and pretty heavy uh I'll tell you copper heat sink on it. But the idea is that on both sides of this board you're going to see that we have DDR5 packages. And in addition you have your 16 ARM Neoverse V2 cores that are also on this card. And so if you think about the impact of this to an application like an AI application, a machine learning application, a vector database, all these types of things that need not just memory capacity, but also need compute and memory bandwidth. Instead of having to go and take data from these chips that are sitting here and bring it all the way back over the CXL bus to the host processor and processing it over there, instead what you're doing is you're processing it all on the same card and you're scaling your compute along with your memory and memory bandwidth. So as I put this card in, I get the memory bandwidth because I have my local controller. Then I also have my 16 armed Neoverse V2 course which gives us more compute and I also get features like that compression. I get all of those really cool things all on a card like this. And let me just show you what that practically looks like if you want to scale both your compute and memory. Now here we have a bunch of different cards and these cards are pulled out of the server so that way you can actually see them so they're not just all sitting there. And you can see that we've now added four of these cards. Now, if you think of these, maybe these are 128 gigs, but of course, they can be a terabyte, 4 tabytes, you name it. And they also have 16 cores. So, this is 64 ARM Neoverse V2 cores with up to 16 terb of memory, all just on cards like this. Okay, so you're probably wondering exactly what I was wondering. How does this even work? I mean, couldn't you take Strutterra A, which has ARM cores, and put it into a x86 server? And the answer is yes. So this behind me is Meta's face running a vector search engine, really doing graph navigation. So this is the kind of thing that if you're doing like um you know like recommendation engines, you're doing rag, like those types of applications, this is the type of activity that you would do. And one of the challenges with that is that you're often memory capacity and memory bandwidth bound. Which means that adding an accelerator with CPU cores fixes both of our problems because remember we can now do the computation locally without having to bring all of the data back to the host processor. We can do that locally on the CXL device and we don't have to go do that traversal. That also means that we're able to scale our compute by just adding more cards. And that's exactly what we're showing here. And so on the left here, you're going to see that we are running at just around 27,800ish queries per second. And you can see our latency is somewhere in that 360 millisecond range. We also are using all 48 of those Xeon 6 cores. We're running somewhere in that 60 to 70% utilization range. But here's a challenge. If we want that server to do more, we're kind of stuck, right? We just don't have a whole heck of a lot more CPU resources. We're stuck in terms of memory. So what we might want to do is we might want to go and add more memory, memory bandwidth and compute. And that's exactly what we're doing on the right hand side over here with Strip Terra A. And if you want to see the system that we're running on, this is what it looks like here running in one of the labs. It's just a little bit louder in here. So instead, we're just going to show you what it looks like. But this is the system this is actually running on. And so what you'll see here is that setup running something that's a little different because you'll see that we have three different structura a cards that you saw plus we have our Zeon 6 cores. Now the Zeon 6 cores we're not going to run any of this on there. So we're going to have see these are all at 0% utilization or somewhere pretty darn close to 0% utilization. Well instead what we're going to do is we're going to cycle through the number of Strateera A devices we have because of course we have P [music] we have CXL and PCIe slots. So we can just keep adding these devices. And so when we have one, two, or three of those, you're going to notice our queries per second scale linearly where we had say 11 and a half to maybe a little over 12,000 queries per second with one card. When we start running the workload on a second card, we're going to get into that maybe 23 to 24,000 query per second range. And when we run that third card, we're going to run in the 34 to maybe 36,000 queries per second range. And you notice our latency has also gone down. We're only at 280 to 290 milliseconds. So, we're lower than running on the Xeon 6 cores only. And the cool thing here is that we're doing that without even touching the Xeon 6 cores on the host server at all. And I want to make sure that everybody understands the gravity of what we're showing here, right? Because we're showing that you're actually getting lower latency on this workload even though you're running a CXL device. Now, a lot of folks will make the correct observation that when you go over CXL, there is a latency hit compared to hitting local memory on a host CPU. I mean, that's always the case. But what you'll see here is that because we have the CPU on the accelerator card that is connected to the memory, what we're actually doing is we now have local access to run the queries and therefore we're getting lower latency on our workload because we're accessing memory in that way rather than bringing that data all the way back over the CXL bus to the host processor. And so when people say that CXL always adds latency, this is a good example of why that's not necessarily the case, especially if you push your compute out to the CXL device itself. Okay, so let's take a look at Structura X. And this again is without the ARM core. So this is a CXL memory controller and with compression. We have that compression turned on here. And let me just show you why this matters. Let's say you have an AI cluster, right? Your GPUs only have so much GPU memory. your host memory is only so big and so if you want to have things like larger KV caches well those can actually give you a much larger impact than the impact of having a CXL memory device in your system and let me show you what's going on here so behind me we have two different demos and really the difference is the size of the KV cache because we are adding a structur CXL memory expansion device that means that we have more memory capacity on the right side than we have on the left side and what we are doing is we're using a llama 3.1 a billion parameter model and we're saying hey model go or chat session go read this novel then give me a summary and then let me ask a couple questions about it and you can see that that is a very useful and very used application for LLMs and what you'll see behind me is the speed of that LLM when we are using that CXL device because we're able to have that larger KV cache you'll see that our time to first token here is only about half a second whereas if we're only running from GPU memory because we run out of memory. We're now at 1.9 seconds. [music] And I will just say that this does vary a little bit based on run and all that kind of stuff. But it gives you an idea that maybe you're 3x or more faster by having the larger KV caches by using CXL memory versus just being stuck with smaller KV caches and getting stuck with a fixed GPU memory [music] size. Now guys, look, KV caches in the industry are a hot topic right now because folks literally see these types of improvements because they're able to go and optimize their KV cache operations. And this is a really good example where folks have said for a long time like CXL is only for general purpose compute, but this is a really good little demo where we're showing how you can use CXL in a GPU in an AI server and you can actually get these better results. And by the way, this again is X, not A. So we are not doing the compute on the CXL device. This is only adding memory capacity. Okay. Now in all of our videos I like to have key lessons learned. I mean what did we learn by doing this? So I think there are a couple things, right? First off that DDR4 use case [music] seemed a little bit weird to folks at first. The idea that they have zetabytes of DDR4 memory just sitting there and they're going to eventually go take those servers out and they're going to retire them, send them to recyclers and you know who knows what happens to that DDR4 memory after that. And you know, the idea was like, well, we can actually just pull that memory. It's not that expensive to go pull it. We can put it into our, you know, these memory controllers and then we number one get the benefits because we're not buying more memory. It might be slower. Fine. Instead, they can reuse that DDR4 memory that's already been manufactured, which means that, you know, your carbon emissions and all that kind of stuff are all lower because you're now recycling or upycling or whatever you want to call it, but you're reusing that memory. So from a green initiative perspective that's awesome. Also from a cost perspective that's awesome. And so now most of the hyperscalers that I've talked to have programs in place where there are looking at you know and actually doing this figuring out how to go and put DDR4 memory into servers or at least pull it. One of the other things to remember is that as we move into newer generations, so starting with CXL 2.0 and then of course in CXL 3.0 gets a lot more exciting. The idea is that eventually we're going to have CXL switches. And so you'll be able to just go and take instead of just saying like, okay, this is some DDR4 memory. I'm going to attach it to one system. The idea is you're going to be able to just go create a shelf, put that DDR4 memory, and load up a shelf with it, and then you'll have a switch, connect multiple systems to that switch, and now you can share all of that memory among different systems. Now, it may not be the fastest in the world, but who cares when it's essentially free or super low cost. Now, the other thing I think is really neat is that we looked at the use case where you add both those ARM Neoverse V2 cores along with the memory. So, instead of just scaling the amount of memory you have access to in a system, you're also scaling the amount of compute. I had no idea myself exactly what that would look like in a system, but the idea that you can actually just go send commands and then those commands tend to be pretty small. And instead of having to go bring a giant payload of bringing all that memory back across the CXL bus into a host processor, processing done, outputs brought all the way back across the bus. If you can just go and do that on the device itself, well, that's the idea of doing like near memory or in-memory computing and that whole field, right? But instead of having to go and have like different DRAM dieseS and all that, you can literally just go and use your standard DRAM, have the controller do it, and it makes life way easier. So, while it is correct to say that, you know, going over CXL may not be as fast as accessing local memory, if you can actually do your processing out on that CXL device, well, now you're not bringing all that data back and you can actually get more performance because you're increasing the total amount of compute and total amount of memory bandwidth that's available to that compute by adding those ARM Neoverse V2 cores [music] next to the DDR5 memory out on one of the Strera chips. Guys, I know this is a little bit dense, but how cool is it to get to see all of the stuff that normally is in labs at hyperscalers, chip companies, all that kind of stuff, but actually getting to go out and see that. I love doing these types of videos. And hey, if you did like this video and you want to just share it with some of your friends and colleagues because you think it's cool, well, definitely go and share this video down below. Let me know what you think down in the comments as well. And if you did like this video, by the way, don't [music] forget to give it a like, click subscribe, and turn on those notifications so you can see whenever we come out with great new videos. As always, thanks for watching.

Video description

We get a behind-the-scenes look at the hyper-scale technology recycling DDR4 into DDR5 servers using CXL. We even get demos of the Arm Neoverse V2 cores that you can add along with DDR5 into servers with the Marvell Structera A line. STH Main Site Article: https://www.servethehome.com/hyper-scalers-are-using-cxl-to-lower-the-impact-of-ddr5-supply-constraints-marvell-arm/ Substack: https://axautikgroupllc.substack.com/ STH Top 5 Weekly Newsletter: https://eepurl.com/dryM09 ---------------------------------------------------------------------- Become a STH YT Member and Support Us ---------------------------------------------------------------------- Join STH YouTube membership to support the channel: https://www.youtube.com/channel/UCv6J_jJa8GJqFwQNgNrMuww/join Professional Users Substack: https://axautikgroupllc.substack.com/ ---------------------------------------------------------------------- Where to Find The Unit We Purchased Note we may earn a small commission if you use these links to purchase a product through them. ---------------------------------------------------------------------- STH Merch on Spring: https://the-sth-merch-shop.myteespring.co/ ---------------------------------------------------------------------- Where to Find STH ---------------------------------------------------------------------- STH Forums: https://forums.servethehome.com Follow on Twitter: https://twitter.com/ServeTheHome Follow on LinkedIn: https://www.linkedin.com/company/servethehome-com/ Follow on Facebook: https://www.facebook.com/ServeTheHome/ Follow on Instagram: https://www.instagram.com/servethehome/ ---------------------------------------------------------------------- Other STH Content Mentioned in this Video ---------------------------------------------------------------------- CXL the Taco Primer: https://www.youtube.com/watch?v=Mp9L7OClb2U Marvell Structera: https://www.servethehome.com/this-cxl-memory-controller-has-16-arm-cores-marvell-structera-a/ Marvell Structera at FMS: https://www.servethehome.com/marvell-structera-a-and-x-cxl-expansion-displayed-at-fms-2024-arm/ First Marvell Structera Demos: https://www.servethehome.com/this-cxl-memory-controller-has-16-arm-cores-marvell-structera-a/marvell-structera-x-and-a-at-marvell-analyst-day-2024/ CXL Announcement 2019: https://www.servethehome.com/intel-cxl-compute-express-link-interconnect-announced/ ---------------------------------------------------------------------- Timestamps ---------------------------------------------------------------------- 00:00 Introduction 01:00 What is CXL 04:13 Checking out Marvell Structera X and Structera A in the Lab 11:27 Some say CXL is slow, but memory capacity matters 17:29 Key Lessons Learned