Which AI Model Wins at Real Coding? OpenHands Index Results | Graham Neubig

TFiR · 163 views · 1 likes

Analysis Summary

30% Low Influence

mildmoderatesevere

“Be aware that while the technical data is valuable, the 'neutral' benchmarking tool is owned by the company selling the deployment services, creating a circular incentive to trust their specific metrics.”

Transparency Mostly Transparent

Primary technique

Human Detected

95%

Signals

The video is a standard interview format featuring genuine human interaction, characterized by natural speech patterns, verbal fillers, and spontaneous corrections that are absent in AI-generated content. The content is highly specific and technical, delivered through a personal dialogue between a host and a subject matter expert.

Natural Speech Disfluencies The transcript contains natural filler words ('uh', 'um'), self-corrections ('about a year ago and uh then some of uh two years ago sorry'), and conversational stuttering ('Just just give us a quick story').

Interview Dynamic The back-and-forth between the host and Graham Neubig shows spontaneous reactions and context-aware follow-up questions that reflect a real-time human conversation.

Personal Anecdotes The guest provides specific details about the company's origin, including the CEO's personal decision to quit his job to work on the project full-time.

Worth Noting

Positive elements

This video provides a rare, detailed comparison of specific LLMs (like Claude 3.5/4.6 and MiniMax) across cost, speed, and accuracy for actual software engineering tasks.

Be Aware

Cautionary elements

The 'neutral' benchmark is produced by a commercial entity whose business model relies on the adoption of the very 'agentic' workflows the benchmark validates.

Influence Dimensions

How are these scored?

About this analysis

Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.

This analysis is a tool for your own thinking — what you do with it is up to you.

Analyzed March 23, 2026 at 20:38 UTC Model google/gemini-3-flash-preview-20251217

More on This Topic

Related content covering similar topics.

Enterprise Adoption is the Biggest Bottleneck for AI Agents | Randy Bias, Mirantis

TFiR

Minimal Transparent

ai agents enterprise ai

Clawdbot (OpenClaw) is everything I was hoping A.I. would be

Dreams of Code

Low Transparent

software engineering ai agents

STOP Using 10 Agents #ai #tech

Minimal Transparent

llms ai agents

The Best AI Won't Sit There. It'll Interrupt You at Exactly the Right Moment. #futureofwork #ai

AI News & Strategy Daily | Nate B Jones

Low Transparent

ai agents enterprise ai

Run Multiple Claude Code Agents Without Git Conflicts (Vibe Kanban)

Zen van Riel

Low Mostly Transparent

software engineering ai agents

Transcript

If you are deploying AI agents for software engineering, you're facing a critical decision which Lab actually delivers when it comes to real world coding tasks. It's not about benchmarks that test basic specs. It's about actual issue resolution, front-end development, and production workflow. And the answer is not that simple. Some models are fast but expensive. Others are cheap but inconsistent. So, how do you cut through this noise? That's exactly what the open hands index solves. It's an up-to-date leatherboard that evaluates large lment models across real software engine. So, which models are winning? What should your teams use? What they should avoid? These are the questions on everyone's mind. That's what we are diving into today with Graham Nubik, chief scientist at Open Hands on this episode of Agentic Enterprise. Rah, it's great to have you on the show. >> Yeah, great to be here. >> Today we are going to talk about Open Hands Index. But before that, I would love to know a bit about Open Hands, the project, the company. Just just give us a quick story about Open Hands. >> Yeah, so Open Hands is an open- source agent for software development. Uh, if you're familiar with cloud code, it's like cloud code but open source and locally deployable. You can use it with any model. started as a community project about a year ago and uh then some of uh two years ago sorry and then some of the community members kind of formed together to build a company around it uh where we you know help enterprises deploy it. >> How old is the company and what's the reason it was created? >> Yeah. So the company's about a year and a half old and we created the company because you know we were already building this project. It had a lot of momentum. people were excited about it and basically uh the CEO of the company Robert uh wanted to quit his job and and do it full-time and so we formed a company so be able to do that. >> Now let's talk about the open hands index. What does it do? How do you actually benchmark these LLMs? >> Yeah. So one of the features of open hands is it's model agnostic. So we can use any model. So every time a new model comes out you know our users were asking us uh is this a good model? Should we try it out? uh like do you support it and everything and so the open hands index is a benchmarking effort that we created in an attempt to try to be able to answer this question very quickly and there's lots of benchmarks for um benchmarking software development agents most famous one is one called Swebench uh some people might have heard about it before um but basically what it does is it checks whether agents are able to solve issues on Python repositories uh open source Python repositories and that's Great. Uh it's a useful thing to do, but there's also a lot more to software development. So uh the open hands index is a much broader uh benchmarking effort where we cover five different benchmarks. Uh SWEBench is one of them, but also things on front-end development, software testing, uh gathering information and other things like that. >> When you run these benchmark, which models are on top, which models are winning, what did you find? >> Yeah, it's a great question and we actually benchmarked from three different perspectives. all of which I think are important. So the first is accuracy um across these five benchmarks. Uh the second is cost and the third is time to resolution. Um and actually the answer to that question is very nuanced. Uh but to give a very short answer um right now the top model is claude opus uh 4.6 which is um you know from anthropic and I think a lot of people are using it uh through cloud code and everything. Um it's top both from the point of view of accuracy and co uh and speed to resolution. So it finishes uh issues very quickly. Uh the only problem with it is it's very expensive. It's also one of the most expensive models. And so there's lots of other options uh that are uh you know pretty good but cheaper. So Gemini Flash is one of them. And then there's also open models like minimax which are very very inexpensive uh you know less capable but they can uh you know you can use them as much as you want and barely put a dent in your finances. >> Now when it comes to coding which is the favorite model what makes it stand out? >> Yeah it's a great question. So um one other feature of the index is that we kind of have different categories. So you know just for kind of coding related tasks claude opus is very strong. Uh but then if you look at for example building an app entirely from scratch. So instead of you know gradually improving an app but you want to create a new thing um codeex actually can be better and the reason why is because it tends to very carefully follow instructions and continue working until it's like really really done whereas uh the cloud models still you know can uh stop halfway through or not kind of fully uh finish all of your instructions. Um and another thing that we really focus on a lot at open hands is reusable workflows. So, like for example, um PR reviewing, you can have a agent review your PRs or upgrading libraries when they go out of date or things like that. And for something like that, you're going to do it over and over and over again. And you really don't need a big uh powerful model like Opus. So, you can go with the, you know, less expensive open source models as well. So, um I I think one of the biggest things that we want to get across with OpenHance index, and you can go there, it's at index.openhands.dev dev um and see is there's a lot of nuance uh to the answer of that question and so like if you want to do this thing you might want to use this model if you want to do this thing you might want to use this model and that's all supposed to be on there >> now let's talk about one of the biggest pain points cost which LLM gives user the best bang for their buck when it comes to software engineering task >> yeah so right now for cost optimization our favorite is minimax um which is a open weights model that was uh just released pretty recently. Um it's the first model that we've seen that is kind of comparable with Claude Sonnet, but it's about onetenth of the price. So it's uh it's a very good deal. Um there's other models that we're in the process of benchmarking right now that have also come out recently like GLM5 uh which I think will be you know similarly well priced uh and also very capable. So uh we're the biggest thing that we focus on in the open hands index is we have a figure which is like the Pareto curve of cost versus accuracy. So like if you want a model that's this cost then you use like this one. If you want a model that's most capable you use this one and you can kind of have a spectrum on there. >> What about context window as these can impact how much you can do there. So can you also talk about which model finds the sweetest spot pricing capabilities and also context window? >> Yeah, that that's a great question too. Um, so the context window of a lot of the recent models that focus on coding is actually pretty large. I I think most of the models that are capable at coding are are specifically trained to have a large context window for precisely this reason. you know, if you don't have a large context window, it kind of falls over. [gasps] On our index, we mostly focus on models that we think are going to be good because, you know, our time is limited and we don't want to spend, you know, our time benchmarking models that we think are are not going to work. So, most of the ones on there kind of meet that default criterion already. Um, but if you want to deploy it locally, that can be a kind of a pain. You need, you know, a big machine or or something like that uh that allows you to handle that full context window without running out of memory. But as long as you're accessing it through an API, usually you'll be okay. >> Can you also talk about how organizations actually work in real life? Do they just pick one LLM and stick with it or they mix and match based on the task, cost, and of course context window? >> Yeah, it's a good question. I would say that maybe right now a lot of organizations are kind of focusing on one language model and the reason why they're focusing on using a single language model is because they don't really have a good view of you know okay for this task I can use this one for this task I can use this one um and uh it it actually requires quite a bit of expertise to be able to really manage that and I feel like most organizations we talk to especially larger enterpris ES quite aren't quite, you know, familiar enough yet to do that. But I I see that coming in the future. It's like, you know, okay, we're we're getting our deployments out. They're starting to really work. Now, the next thing we want to look at is how to not, you know, be spending millions and millions of dollars on uh on API costs. And to be honest, the the cheaper models are getting so good nowadays that a lot of the, you know, simpler tasks that we do really we don't need a really expensive model. So I I see a lot more of that coming in the future and that's also what we're trying to enable. >> As we have seen in the modern tech stack, developers choose the tools that they want. It's not forced onto them from the corporate board level. But when it comes to LLM, is it the same case? Do teams get to choose which models they want to use or it is a decision that is made or is it a decision that the organization makes for them? they choose and say, "Hey, this is the model you'll be using." >> Yeah, it's a great question. So, some of our larger uh kind of customers that we work with at OpenHands are big enterprises um and security sensitive ones. And in those cases, they usually have a trusted language model. It might even be one that's deployed on their infrastructure. And so, they don't, you know, kind of shop around uh quite as much for language models. Um, on the other hand, we have a lot of people, for example, in our open source community who are like a new language model comes out and it's like, I got to try it. You know, I'm going to throw it into open router and like immediately, you know, test it out and then they're like, okay, but is this like actually the one I should be using or is another one better or something like that. So, I I see the whole spectrum basically. >> How often do you update this index as new models keep coming out? How do you manage that? Yeah. So the open hands index when a new model comes out that people are interested in we try to support it. Uh this is actually not the easiest thing to do because running agentic evaluations is you know expensive and we have pretty good infrastructure but even when you have good infrastructure each new model is kind of a unique snowflake like something else is different. uh I I don't think uh the industry is nearly as standardized as people might imagine on you know what you need to do to get a model actually working like a new one will add new thinking blocks or you know all of these all of these other things and so um if everything goes well we can try to support something in like a couple days um like a day or a couple days because that's just the time it takes to run the index overall but you know if something uh breaks and we need to do a new implementation or we need to debug why we're not getting the results we expected then it can take a little longer. Now you are doing all this work. How do people get access to this was information that you are creating? What is the mechanism for them to get involved for them to leverage it and if you can also talk about who is using it? Let's talk about how organizations can actually take advantage of the work that you're doing. >> Yeah. So I I think there's all levels of how much you could use this. So like there's index.openhands.dev dev where we have a very high level summary um that you can look at and say okay this is the model that has the highest cost um or the best cost this is the model that has the best accuracy um you can then click onto each of the individual use cases if you really want one for issue resolution you can click onto the issue resolution use case and see what's in there uh the front-end development use case and see what's in there um and other uh stuff like that all of our code is open source uh for benchmarking, not the infrastructure code, but the the benchmarking code is open source. So you can actually see what happened and all of the results including all of the conversations from the agents are included on the site. So you can actually go and see for each individual example uh what the agent did. So depending on what level of detail you want, you can go anywhere from a very high level summary to a very lowle uh you know detail of exactly what the agent did in a particular case. and organizations I'd say one of the biggest use cases we have for it is um when we talk to an organization that would like to work with coding agents like open hands we show them okay you know these are the trade-offs you're making by selecting a particular model like you know this model might be good on this use case this model might be good on this use case this might be the expected cost and that sort of thing and I think that's really important as we you know start moving into more model agnostic um kind of uh pick your own model. >> Based on the index and your own experience in this space, what should engineering teams be looking at when deciding which LLM to use for their agents? Of course, depending on the industry and use case. >> Yeah, I I'd say my initial advice is um you know, the open hands index is just meant to be a first pass. It's not your like actual use cases that you're incorporating in your uh you know enterprise but like basically you can take a first look at these sorts of benchmarking results but in the end it's like is this working in deployment is it meeting our cost um in other you know uh requirements like that and in addition to the open hands index one thing that you know we've been working on a bit more is kind of observability So you can take the output of an index that you've actually uh sorry of a agent that you've actually put into production and then uh you know gather all of the conversations that it's had analyze each of the conversations with a language model and then kind of aggregate all of that uh you know a aggregate all of that information and we're partnering with a observability platform called laminer to uh do that sort of thing and so it's like something like the open hands index as your first pass and then after you've actually deployed it to make sure you can get visibility into what you know all of the users are doing um analyze that with a language model of course you know because we can do that now and then uh gradually improve your prompts improve things like skills uh that are um you know instructions on how to use an agent in a particular use case. So it it's it doesn't start and stop with you know an index. It it stops with deployment and monitoring and other things like that as well. >> When you evaluate these models do you also track usage patterns? Because a lot of folks feel that AI is magic pixel dust models are great but the way they are being used and deployed that's where the problem happens. People say it's not working. It's not because the car is broken. It's because you don't know how to drive. How do you track that or do you track that at all? >> Yeah, exactly. So, I got into I talked a little bit about skills which are essentially prompts like they're fixed prompts that you use for a particular use case. And you can do that right and it really helps. And then there's also even some research recently that demonstrates that the skills that people are creating on real repositories are are bad and they actually hurt accuracy. uh and especially if you automatically generate them by agents without any sort of you know monitoring or stuff like that. So yeah I think it that's a huge thing. Another huge thing is context engineering. So um it's like how do you get the context that is necessary into the agents and that's yet another thing. So no matter how good the brain is if you don't give it the information it needs to do a good job it's it's not going to do a good job. >> What's next in your radar or pipeline? Everybody's using these LLMs. What's the next challenge organizations face when they deploy this model? What should you look at to help them evaluate that? >> I I'd say by far the number one challenge right now is uh verification. So like right now code is cheap. It never was before but now code is cheap. You can generate as much code as you want. Uh but good code is not cheap because you need to go back and check that it actually works, actually meets the requirements, doesn't add a whole lot of tech debt, other uh you know issues like that. And so within our pipeline um we have a few things where we're looking at that. Essentially um we've trained models uh to tell whether the code is going to survive into the future. So um we can kind of like before a human even takes a look at it guess whether you know the code is like actually good and uh going to continue on later. Also uh you can kind of automatically prepare your repository so that you can verify the code that comes out using first just unit tests and static analysis like the standard tools that uh you know people always use before all the way to these sorts of like AI code review or AI you know prediction of code survival and things like that as well. So that that's what we have in the future. Um it's kind of separate from the open hands index but we're going to merge them together uh you know soon I guess. >> Ram thank you for joining me and breaking down the open hands index. It's clear that choosing the right LLM for software engineering isn't just about raw performance. It's about balancing cost task specific strength and real world deployment results. Thank you for sharing those insights and I look forward to our next conversation. Thank you. >> Okay. Yeah. Thanks so much. This was a great uh great you know experience >> and for those watching if you are deploying AIS or trying to figure out which LLM makes sense for your teams please make sure to check out OpenHance and OpenHands Index at openhands.dev and don't forget to subscribe to TFR like this video share with your teams and thanks for watching.

Video description

If you're deploying AI agents for software engineering, you're facing a critical decision: which LLM actually delivers when it comes to real-world coding tasks? It's not about benchmarks that test basic specs, it's about actual issue resolution, front-end development, and production workflows. The OpenHands Index solves this by providing an up-to-date leaderboard that evaluates large language models across real software engineering tasks. Graham Neubig, Chief Scientist at Open Hands, breaks down which models are winning, what engineering teams should consider when choosing an LLM, and how organizations can balance accuracy, cost, and context windows. From Claude Opus 4.6 to open-source alternatives like MiniMax, discover which models deliver the best bang for your buck. Read the full story at www.tfir.io #LLM #AICoding #SoftwareEngineering #OpenHands #ClaudeAI #CodingAgents #MLOps #DevOps #EnterpriseAI #AIAgents