bouncer
← Back

Craft Computing · 41.3K views · 1.5K likes

Analysis Summary

30% Minimal Influence
mildmoderatesevere

“Be aware that the 'affordability' of the $6,000 server is framed against the extreme cost of brand-new enterprise gear, which may still make it an impractical 'deal' for most hobbyists.”

Ask yourself: “Did I notice what this video wanted from me, and did I decide freely to say yes?”

Transparency Transparent
Human Detected
95%

Signals

The content features a well-known human creator (Mikey from Craft Computing) with a distinct personal brand, including specific recurring segments and a conversational, opinionated tone that lacks the formulaic structure of AI-generated scripts.

Personal Voice and Anecdotes The narrator uses first-person pronouns ('my fleet', 'I'm Mikey'), mentions specific personal goals (opening the server to his Patreon), and includes a recurring personal segment ('What am I drinking???').
Natural Speech Patterns The transcript includes conversational filler and informal phrasing like 'AI bros', 'little experiment', and 'barely moves the needle'.
Niche Domain Expertise Detailed technical comparison between SXM2 form factors, HBM2 memory bandwidth, and specific server models (Inspur DGX V100) delivered with a consistent host persona.

Worth Noting

Positive elements

  • This video provides highly specific benchmark data comparing legacy HBM2-based enterprise cards against modern GDDR6X consumer cards in LLM inference tasks.

Be Aware

Cautionary elements

  • The video is a 'triple-threat' of affiliate marketing: it promotes a cloud service, a hardware vendor, and its own Patreon/merch simultaneously.

Influence Dimensions

How are these scored?
About this analysis

Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.

This analysis is a tool for your own thinking — what you do with it is up to you.

Analyzed March 13, 2026 at 16:07 UTC Model google/gemini-3-flash-preview-20251217
Transcript

So, it's been a few months and it's time for an update on the latest server in my fleet, the Insper DGX V100. Over the last couple weeks, I've been running some tests to see how it performs when running AI models. While the ultimate goal is to open up this server to my Patreon, allowing users to access various models in the cloud. I need to see how everything performs first, along with figuring out how to grant access to the server once it's up and running. Today though, I wanted to go over some of my early benchmarks, talk about why a massive 8GPU server like this, despite its age, might still be relevant today, and go over some benchmarks against some top consumer GPUs. And I'm sure all of the AI bros will tell me that I did everything in this video wrong. So, what do you say we learn something together? Welcome back to Craft Computing, everyone. As always, I'm Mikey. Before we get started, once again, a huge thanks to Unix Surplus for sending over the Insper DGXV 100 for this little experiment. Links on where to buy the same server will be down in the video description if you're interested in picking this one up for yourself. A quick recap, the system is running eight Nvidia Tesla V100 GPUs in the SXM2 form factor. Each card contains 5,120 CUDA cores based on Nvidia's Volta architecture, a design that never really made its way down to consumers. Sitting between their Pascal and Turing lineups, each card also has 32 GB of HBM2 memory on board with a 4,96bit bus. In total, we're looking at roughly 900 GB per second in memory bandwidth per card and a total of 256 GB of video memory in the server. In the world of modern AI processors, that might not sound like much, though. What with the Nvidia B300 boasting 288 GB per GPU and up to 2.3 terabytes per system. But at just over $6,000, the Insper DGXV100 might still be one of the most affordable ways to run some of the latest highdemand LLMs. But even $6,000 is a lot of money, especially if you're simply looking at evaluating whether or not an AI workflow might be beneficial to you or your business. Instead of investing in pricey hardware, you should take a look at the sponsor of today's video, Verta, formerly Data Crunch. If you've been struggling to spin up AI models on your current hardware or are hesitant to invest tens of thousands of dollars, Verta might be for you. They offer instant access to high performance NVIDIA GPUs, optimized clusters for training large AI models, and serverless inference with autoscaling, all without waiting days for provisioning. They offer GPU solutions to fit any budget. From the V100s, like I have here, all the way up to Nvidia's latest B300 GPUs, ready to be put to use in your most demanding workloads. What's more, Verta data centers run entirely on renewable energy, so your AI workloads don't just run fast, they also run green. Plus, with Alver Verta servers residing in the EU, they're protected under GDPR and some of the world's strictest data privacy laws. You get secure ingress and egress of all of your data with no hidden compliance headaches. Whether you're ready to move your AI workflows to the cloud or just need to evaluate which solution is right for you, Verta is the cloud provider that is fast, affordable, private, and sustainable. Check out what Verta has to offer by going to verta.com or clicking the link down in the video description. Plus, if you use coupon code craft computing v100, you'll get $25 off at checkout. Again, that's verta.com. And thanks again to Vera for sponsoring today's episode. Starting off, what exactly are we testing here today? Obviously, the DGXV100 was built for machine learning and AI workloads, but it's also pushing 8 years old. As I mentioned, while the sheer amount of memory still dwarfs modern consumer GPUs with up to 256 gigabytes combined, it barely moves the needle against recent data center chips. It's no surprise that hyperscalers are starting to move on from V100s. But for smaller organizations or even extreme home labers like myself, it's still among the cheapest ways to get that much video memory into a single system. Now, on top of the eight Tesla V100 cards, the Insper DGX is also loaded when it comes to traditional compute with a pair of Intel Cascade Lake Xeon 8260 CPUs, each of them with 24 cores and 48 threads, giving us a total of 96 threads to work with. We've also got 512 GB of DDR4 ECC memory, meaning even if AI models can't fit entirely into our massive VRAM cache on the V100's, we still in theory have plenty to offload for larger models. I say in theory because my goal here is to run AI models entirely in VRAM as that's how we'll see max performance. Plus, as I'm testing consumer GPUs on a different system entirely, I don't want to conflate my results by running outside of the GPUs themselves. In the first video with this server, I installed a full set of eight 1.92 TB Patriot Burst SSDs, assuming that with a two drive failover in a RAID Z2, these inexpensive SSDs would work just fine. Well, a week into testing and one of those drives failed. Another 3 days later and I had a second drive fail. Luckily, Unix Surplus again had my back and shipped out eight Samsung 1.92 TB Enterprise SSDs to get us back up and running. For testing the V100, I've got Proxmox installed, which will allow me to create VMs and be able to pass through V100's in whatever quantity that I need. This will allow me to quickly test one GPU, two, or even all eight at the same time. without complicated configurations or reinstalling the OS onto bare metal. Keep in mind the server is up at 01's YOLO Colo. So everything that I'm doing with the server has to be done remotely. To get things started, I ran ML Perf, a standardized benchmark suite developed by ML Commons for testing AI and machine learning performance regardless of the platform. And while this test is designed to run on any hardware that you have, whether CPU or GPU, it also seems to benchmark only a single GPU rather than being able to scale to multiple cards. As such, we're going to be able to get numbers for a single V100, but not see how well they scale with multiple cards running a single model. But the numbers are still interesting, especially considering that the Tesla V100 is 8 years old at this point. In Llama 3.1, we see an average time to first token at 0.33 seconds and 84.6 tokens per second. Comparing those results to the current top-end consumer card in the RTX5090, it's no surprise to see the 5090 way out in front, running a full 2 and a half times faster with 211 tokens per second. But that's also not the whole story. While the 5090 was faster, it also peaked at 475 watts during testing. Meanwhile, a solo V100 peaked at just 175 watts, making it actually slightly more efficient per token than the 5090 was. Moving on, the RTX 5080 is almost exactly half the performance of a 5090 in gaming, but it managed to come in well above the halfway mark at 143 tokens per second, while also well outpacing the V100 for both speed and efficiency. Now, one thing that I wanted to figure out was what modern GPU is the V100 equivalent to when it comes to AI testing. So, I did quite a few more tests and finally dropped in the RTX 4070 Super and got a result of 88.6 tokens per second. And it looks like we might have a winner. The V100 trades blows with the RTX 4070 Super, staying within about 5% of it in every metric that I tested. But there's also one big caveat in the 4070 Super as it only has 12 GB of VRAM compared to 32 GB in the V100. Obviously, the 4070 Super is less expensive. But you're also going to be hardressed to get eight of them into a single server, let alone into a 2U space like the Inspert DGX system that we have here. With the stock air cooler, you could possibly fit four of these cards into a 4U server, or upwards of seven of these GPUs if you went with a full custom water cooled solution and cut them down to a single slot each. But that's still less than half as dense when it comes to GPU compute per U and barely one-third the VRAM capacity of the V100's. Looking at the 54 reasoning benchmark, the results are nearly identical. The V100 comes in with 59.7 tokens per second with the 5090 clocking 138 or roughly 2.3 times faster. The 5080 also comes in almost exactly between the two. Again, in this test though, the V100 comes in well ahead of the 4070 Super at roughly 15% faster. As I'm sure a lot of you will be quick to point out though, a 14 billion parameter model isn't exactly stressing the VRAM on any of these GPUs. Unfortunately, that was the largest model available in ML Perf, but it still gave us a good idea of what kind of speed we'll be looking at when it comes to single GPU instances. For smaller models, there's no reason I couldn't run eight VMs in the DGXV 100 with eight of these different models installed. But what about a larger model that requires a ton of VRAM? The easiest way I know of to run large models on your own hardware is through a program called LM Studio. It allows for downloading massive LLMs and running them offline locally on your own system. There are advantages and disadvantages of using LM Studio. On the upside, it's multiplatform. So, you can run the same models whether you have Windows, Linux, or even Mac OS. And it works with just about every hardware type as well. Whether you have Nvidia, AMD, Intel, or even Apple, you give it a GPU or CPU with enough memory, and it'll run the model for you. The downside though is it is 100% local to your PC and your local desktop guey. There's no web UI to access the models remotely. So you can run it as a local user on your desktop but not from another PC. So it's great for testing which models can run on your hardware, but not great for me and wanting to share access to these models to multiple people. The models we tested in MLPF were 8 billion and 14 billion respectively. And those can both pretty easily run inside of a 12 GB GPU with little to no spillover into system memory. So what happens when we jump to a slightly more demanding model like Meta Lama 3.3 with 70 billion parameters? A 70 billion parameter model according to this very AHI can require up to 280 GB of memory to store model weights. However, in my testing, that number was closer to 48 GB running four Tesla V100's with each GPU allocating roughly 12 GB of VRAM to store the model. And the system did register around 52 GB of system memory as in use while the model was loaded, but LM Studio was only using around 3.5 of that. Now, I fully admit I'm not throwing the most intense prompts at this model, but 4 GPU seems to be more than enough for it to run quite well. On average, we're seeing around 14 and a half tokens per second. The first question asked when the model was loaded did take around six seconds to generate the first token. After that, subsequent questions were much faster at around 1/3 of a second. When I throw all eight GPUs at the model, those numbers didn't change at all, still sitting right around 14 1/2 tokens per second with a TTF of around 0.3. This lines up with some of the advice I've given before around running AI models. Most of the time, if you can fit a model in memory, it'll run on whatever GPU you give it. And more compute doesn't necessarily mean faster or better results. In the case of Llama 3.3, we're easily able to hold the entire model in GPU memory, and the added compute doesn't help it process any faster. In fact, even dropping down to just two V100s shows little to no impact in performance, seeing even slightly better results of 15 1.5 tokens per second and a similar TTF of 0.25. 25. Given we're still seeing 48 GB of VRAM utilization, I would still need at least two V100s to run this model. But that also means I could easily run four 70 billion AI parameter models on this server at essentially full speed. But what if we wanted to go even larger? Jumping up to a 120 billion parameter model doesn't necessarily mean we'll be giving up performance. Llama 3.3 is what's known as a fully dense model, meaning that every parameter of the model is used for every calculation. we can actually move to a larger model, say OpenAI's GPTOSS with 120 billion parameters and actually see improved performance despite its much larger size. GPTOSS is known as a dense model with noe or mixture of exports. Essentially, it's a more efficient transformer model and can lead to around 70% more flops per token despite the larger database. The larger data set does mean significantly increased memory requirements with around 60 GB utilized by the GPUs. However, it should be noted that GPT OSS refused to load when I had just two of my V1 32 GB cards installed. I had to move to four GPUs for the model to load without errors. And even then, I ran into an issue when it came to inference. See, each token generated by an LLM still takes up space when interacting. And those tokens need to be queried if your conversation continues beyond one or two simple prompts. The default context size in LM Studio is 4,96 tokens, which fills up rather quickly on such a large model. I didn't even make it through my second question around hardware requirements before the context overflowed and the model simply stopped responding. Every response from an LLM generates tokens. And the longer your conversation is, the more tokens need to be generated and stored. And the larger your model, the more tokens are required to search the massive parameter database and generate responses. So it's not just model size that you need to take into account. It's also the size of the responses that you generate. Each model that you install will have a max context size and needs to be configured to meet your needs. Each model has its own practical, theoretical, and breaking point limits when it comes to context length as well. Tokens require space. As an FP16 workload, one token is around 256 kilobytes in size. That means 100,000 tokens is 25 GB worth of context, and that needs to be stored on top of your model memory requirements. In theory, GPT OSS with 120 billion parameters can support up to around 130,000 tokens, but anything beyond around 32,000 can start to cause stability or worse, hallucination issues. But all that said, I was able to get GBT OSS 12B up and running on four V100s and saw some rather impressive performance, upwards of 60 tokens per second, which is actually right in line with what the AI model told me I should be seeing on this exact hardware. So, running large models is definitely still very possible on this hardware and likely much more affordable than buying four RTX 590s or a pair of RTX 6000 Pro Blackwell GPUs to make it happen. So, some pretty interesting results and definitely encouraging knowing that $6,000 can still run some pretty impressive models, even given the age of the hardware here, but I did run into a couple funny scenarios with these offline models. One of the main benefits of LM Studio is the fact that the models run entirely offline and cannot access the internet. This means your data stays private and nothing you send into it will leave your server. However, that also means that it won't seek out new data or confirm any of its output by searching the internet. Let's take me as an example. I'm Jeff with Craft Computing. I say it at the beginning of every video I've ever published, but I'm also not the most popular channel in the world. And outside of my niche of home labers and server enthusiasts, you're unlikely to actually know who I am, which makes me the perfect test candidate for seeing whether or not an AI is bullshitting you with its output. GPT's OSS model is current up to September 2023, which means it should in theory have seven plus years of my channel to look back on if that data is in fact in its database. Asking, are you familiar with the YouTube channel Craft Computing gets an enthusiastic yes, Craft Computing is run by Mike, aka Mikey, a former software engineer turned full-time tech educator. Um, last time I checked, I'm Jeff. In fact, I've always been Jeff. So, it got that wrong. But what else did it miss on? According to AI, my channel was founded in 2020 with simple how-to videos around Raspberry Pies and Arduinos. I have a Discord community of around 12,000 people. I host monthly hackathon events along with a challenge corner series where viewers submit project ideas and I build them live on stream. Obviously, none of that is accurate. So, I decided to challenge the response and ask it to site sources just for my name. I'm sorry about that mistake. The channel's primary presenter isn't named Mike. The correct information with publicly available sources, the Craft Computing YouTube channel is run by James Miller, who appears in every video introducing himself as James or Jay, and he has a trademark intro starting every video by saying, "Hey everyone, I'm James and welcome to Craft Computing." That's closer, I guess. But the problem is 80% of the way there only gets us to James, not to Jeff. But things actually get worse from there because it's not just the fact that AI made up everything about craft computing. When challenged, it cited sources directly to my relevant social media pages, including my YouTube about page. It made up a Wayback Machine snapshot to establish my first video published. It made up SocialBlade listings for my channel size and origin, an inaccurate Discord link, and a link to my Patreon where it again claims that the creator's name is James Miller. But the links that it provided did land on my social media pages. So, I wanted to go just a little bit deeper, asking what some of my more popular videos were. And I got a list of 10 videos that not only have I never produced, but searching for these titles on YouTube yielded zero results. I don't know if I'm more offended that AI actually doesn't know who I am or that the viewership numbers it's claiming are tenfold what my channel actually does. I can take solace in the fact though that even though AI thinks I produce content around Raspberry Pies, at least this time it didn't confuse me with Jeff Gearling. That only happened when I was writing this script inside of Google Drive. I decided to repeat this test a couple different times in a new chat window with fresh context every single time. And each time AI made up some new about my channel, who I was, and what videos that I've published. Everything from retro computing and Commodore 64s to a software developer to Raspberry Pies and Arduinos, but never home lab open source enthusiast gaming, beer reviewing, or anything that I actually do. But there was one anomaly that was repeatable that I can't fully explain. Whenever I prompted AI to tell me how up-to-date its training data was, it always responded with September 2023. When I started a conversation with that specific question and then asked about craft computing, I received an honest answer of I don't know. And that right there is the inherent problem with generalized AI and LLMs. AI has been trained to give human readable responses to questions, not to be able to fact check or to reason. Even though this model is technically a reasoning model and should be able to fact check itself, I'm sure you've seen the same thing on most online models as well, where when questioned about the factuality of a statement, AI will instantly simply agree with you rather than its own data set or even looking up sources to confirm. As long as the data is approximating what a user wants to see, the LLM has achieved its goal. Now, that's all to say that AI still has a ton of use cases, and this video is not necessarily designed to side with one side or the other for any viewpoint. LLMs are something that exists, and people might want to run them on their own hardware. And I wanted to demo both the models running as well as demonstrate pros and cons of running these models. It's up to you, the user, to decide what to do with it and how to use it. I'm not necessarily anti-AII as there are tons of areas that benefit from its use from cancer research and diagnostic imaging, real-time machine vision, logistics and safety in industry. In fact, I recently did a video with Super Micro talking about some such use cases. And while text LLMs also have their use cases, in my opinion, they shouldn't be used for research or fact-checking. the output they're designed to give is trained on yes, I have that answer to that question rather than let me find the answer to that question. Using an LLM for non-ressearch tasks, though can be extremely powerful, from checking the tone of an email that you want to send to helping reword or summarize what you've been reading. I don't mind the use of LLMs as writing aids, but the facts and research should still be entirely from humans. All that being said, while this video is 100% about LLMs, not a single word in this script was written by or assisted by an AI. Nope, you can blame me, James Miller, for 100% of the content here. Back to the hardware itself, I'm actually really impressed with how well the Insper DGXV 100 performed overall. While it is absolutely out of reach of most users, for small businesses or the 1% of home lab enthusiasts out there who are wanting to experiment with AI models, it's still likely one of the most affordable self-hosted options out there. With 256 GB of VRAM and 256 GB of DDR4 ECC with the ability to load it up to 1.5 tab, the possibilities are pretty high with what models you can run on a system like this. Again, huge shout out to Unix Surplus for sending out the Insper DGXV100 system for this project. Links on where to pick one up for yourself, as well as to see all the rest of the hardware they have in stock will be available down in the video description. If you have any other AI use cases you'd like to see me run on the server, let me know down in the comments below. I'm also looking for recommendations for granting access to the system to my Patreon backers, either through authenticated web access or maybe even a Discordbot integration. So, if you have a suggestion for that, feel free to drop a comment down below. Speaking of the Discord, the only way to get access to it is by joining the Patreon. Rather than 12,000 random internet trolls like AI claimed earlier, we have about 800 active monthly users in a very close-knit community with the Patreon serving as a moat to keep the trolls at bay. If you want to chat directly with me, join the Patreon by following the link down in the video description. And that's going to do it for me in this one. Thank you all so much for watching and as always, I'm Jeff and I'll see you in the next video. Cheers, guys. Beer for today is from Dashes Brewing. It is a non-alcoholic freshsqueezed IPA clocking in at 0.5% or less. Always love to shoot fresh squeezed and kind of curious to see how the non-alcoholic version stacks up. Looks like a beer so far. So, I will say for a non-alcoholic IPA, this is pretty solid. But as I'm drinking this, I can't help but miss Athletic Brewing just a little bit. Uh I'm a huge fan of of Athletic Brewing. Uh they make a series of dynamite non-alcoholic beers. Uh hazes, IPAs, brown ales, golden ales, all of them are pretty fantastic. This one from Dashes, their Fresh Hop. It's good, but it's definitely a step down from the standard Fresh Hop IPA. And unlike the Athletic Bruise, there's a little bit of a tell that this one is non-alcoholic. Um, anyone who's had one of the cheaper non-alcoholic beers, I think Odoul's made some for a while. Um, there's this like riceike flavor that sits in the back of your throat and just kind of accumulates. It's not terrible with this beer, but it is present and uh it definitely takes away from every bit of flavor that this beer does have. It's really good right up front. It It's It's got a little bit of that strata galaxy hop kind of kind of bite to the front of it, which is really pleasant. But then as soon as as soon as I swallow, the only thing I'm left with is that slightly burnt rice husk kind of flavor in the back, and it really takes away from the experience. Overall, not bad, but definitely could be better.

Video description

Thanks to Verda, formerly Datacrunch.io, for sponsoring today’s episode. Check them out at https://verda.com, and use Coupon Code CRAFT-COMPUTING-V100 for $25 off at checkout! https://docs.verda.com/resources/obtaining-free-credits/how-to-redeem-credits Grab yourself a Pint Glass or Coffee Tumbler at https://craftcomputing.store Implementing any form of AI workflow into your business is prohibitively expensive. From just the cost of hardware to the power and cooling infrastructure of modern AI servers, it's enough to turn anyone off. But you know me... I love digging up old servers out of eWaste piles and giving them new life. But how well does Nvidia's Tesla V100s stack up to modern cards in AI? Today, we're testing Eight Tesla V100s in 70B and 120B LLMs to see if there's life still in these eight year old GPUs. But first... What am I drinking??? Deschutes Brewing (Bend, OR) Fresh Squeezed IPA NA (0.5%) HUGE THANKS to UnixSurplus for sending over the Inspur DGX V100 system for me to take a look at. Check them out at https://UnixSurplus.com Or their eBay store: https://ebay.us/6BPOyd *Links to items below may be affiliate links for which I may be compensated* Inspur DGX V100 Server from UnixSurplus: https://ebay.us/1U6I0H Dual Intel Xeon 8260 24-Core / 48-Thread 256GB DDR4-3200 REG-ECC 2x MZ-7LH1T90 1.92TB SSD Drives 8x Nvidia V100 32GB SXM2 GPU 1x 4x 10GB SFP+ Mezzanine Card OB NIC: 100GB Nvidia CX6 network card Follow me on Bluesky @CraftComputing.bsky.social Support me on Patreon and get access to my exclusive Discord server. Chat with myself and the other hosts on Talking Heads all week long. https://www.patreon.com/CraftComputing Timestamps 0:00 - Intro 2:11 - Sponsor - Verda.com 3:29 - Speeds and Feeds 6:10 - Llama 3.1 8B (MLPerf) 8:16 - Phi 4 Reasoning 14B (MLPerf) 9:06 - LM Studio (Llama 3.3 70B + GPT-OSS 120B) 14:56 - As Always, I'm Mikey 20:15 - Wrapping Up 22:57 - Deschutes Fresh Squeezed NA

© 2026 GrayBeam Technology Privacy v0.1.0 · ac93850 · 2026-04-03 22:43 UTC