We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Building Nubank · 2.8K views · 88 likes
Analysis Summary
Ask yourself: “Did I notice what this video wanted from me, and did I decide freely to say yes?”
Worth Noting
Positive elements
- This video provides a rare, detailed look at the actual architecture and evaluation pipelines used to manage LLMs at a scale of 100 million+ users.
Be Aware
Cautionary elements
- The speaker uses his significant academic 'halo' (citing specific papers and algorithms) to discourage skepticism of the company's technical stack.
Influence Dimensions
How are these scored?About this analysis
Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.
This analysis is a tool for your own thinking — what you do with it is up to you.
Related content covering similar topics.
Building an AI App in TypeScript & Bun with Groq SDK using Function Calling
Zaiste Programming
WTF Is OpenClaw? And Should You Even Care?
Elevated Systems
Level Up Your LangChain4j Apps for Production
Java
Build Apps Faster with AI | Vibe Coding with Goose
goose OSS
A Conversation with Jiquan Ngiam About Agent + MCP Security
Unsupervised Learning
Transcript
[music] [music] Thanks everyone. Uh very excited to talk about uh how we build AI agents for customer support at uh New Bank and how we ensure we love our customers uh deliver value to them without degrading experience and still leveraging the best of AI. So before I start uh we do have a booth here and we are handing out swag at the end of this talk and at the booth. So feel free to stop by. We are solving an amazing array of challenges uh AI or otherwise technically. So we would love to chat with you and I'll be at the booth as well. So if you have any questions feel free to stop by. I'll leave my email address here in case you need to get in touch. So the talk is divided in uh a lot of different parts. So we'll cover a lot of different things but let me start by telling you what do we actually do. So we are Neo Bank we are redefining banking and financial services uh in Latin America and uh and the world. Today we have 127 million customers spread across three different countries Brazil, Mexico and Colombia. And we deliver an app first experience that our customers love. So our customers love the way we empower them financially, how we help them in the financial lives and technology is at uh the is the heartbeat of this business. This is the story of New Bank. Founded in 2013, became a unicorn in 2018, IPOed on the NYC in 2021, and now we are the largest digital bank outside Latin America. We are the only digital bank outside Asia to get to 100 million plus customers and we offer an array of solutions to improve the financial lives of our customers including credit cards, accounts, investments, lending, insurance, new cell and a bunch of other things. About me, I am a principal engineer in the AI core team at New Bank. I spent uh about 11 years previously at different companies like LinkedIn, Apple and Amazon. uh at New Bank I focused on foundation models, LLMs for finance, predictive, generative, you name it. And previously I had have a track record of shipping a lot of uh deep tech or AI and ML product uh work in products uh multiple billion dollar business impact in revenue, double digit improvement and engagement for large companies. And in terms of research, I published at pretty much all the top AI conferences like Nurips, ICML, ICLR, KD. Uh so I'm pretty active. I like to in the research community, I like to keep one foot in the research world and see what ideas can impact product. Uh more uh importantly my some of my recent work on LLMs. So I was the tech lead of LLM efficiency at LinkedIn recently where I worked on on policy distillation compression including quantization and pruning of models. A lot of these were deployed in production. We had a paper on this at EMNLP this year in China. Uh we also had a breakthrough algorithm published at ICML this year on LLM preference alignment. Think how you go from GPD to chat GPD the algorithms used there. That's I built a new algorithm in that space. I'm also the creator of quant ease which is the the first ever uh quantization algorithm at at scale that uses coordinate descent. So it's much much better in terms of stability than algorithms like GPDQ. And uh very recently I did a tutorial on efficient SLMs at the ACM webcon in Sydney Australia. Uh so as you can see I'm pretty active in the research community. I like to build uh stuff around LLMs, but I also like impact. So I don't like to do research in isolation. Uh so I would love to talk to people if you you know have ideas, thoughts. If any of these you know resonate with you, happy to chat with you offline. That's my email address. Feel free to reach out or I'll be at the booth all day after the talk. Okay. So that was about New Bank and me. So I I hope you understand what we do and how uh I am uh trying to help the business. So let's talk about the problem at hand. Uh as you know we are an app first company. Uh customer support uh use helping our customer is really important to us and our customer support agents the the Xpers they are amazing at taking care of you know the financial needs of our customers and they do an amazing job at helping you get credit cards information about loans helping you uh renegotiate debt a bunch of different things and AI plays a central role in this entire ecosystem. uh one we have AI agents to help uh you know supplement or augment our Xers we also have co-pilots we have you know systems that help improve their efficiency uh internally we have a bunch of you know AI products that our team uses to search and be assistants and we also have this product we tied up with OpenAI called Chad GPT go which is a you know lowcost tier for exclusively offered to new bank customers in Brazil and helps them leverage the power of models from OpenAI at a fraction of the cost for plus and pro subscribers. So what is the core problem that we really really want to solve? We want to delight our customers and we want to do it in a way where it's instant, it's empathetic, it's useful, but at the same time it has to be safe. It has to be actually correct. So that's really the core tension here that if you use AI, how can we do this in an effective way where these models that we use don't hallucinate or don't make mistakes. Uh so customer trust is very important because if you lose customer trust once uh it's very very hard to gain it back. And here are some example uh use cases where you know different companies and even new bank uh tries to build AI agents. For instance, car delivery and logistics. If you order a new credit card, where is it? Is it lost? Did Did you Did it get delivered to your neighbor? If it's lost, what do we do about it? Your maybe your dog ate it and you want a new one. So, how can we get it back to you in a in super quick time? Debt renegotiation. Let's say you have a credit history that's not so great. How can we help you renegotiate your debt? How can we help you get back to a good record of having a good uh credit history so that you can be more involved in the financial ecosystem? uh you want a new credit limit and you want to you know AI to kind of uh short circuit the process for you and uh a bunch of fraud and uh security related use cases and and the list is pretty pretty large. These are just a few I chose for the talk. So as you can see that everything our customers do has to do has to involve you know our our X peers our AI agents in a big way and since we are app first uh voice and text are very big channels for us for our customers who reach out to us so our XPS do an amazing job our uh NPS is super high if I remember correctly it's about 84 which is really high uh so clearly customers love our XPS but what do we do about scaling our ex peers about augmenting them, empowering them through AI agents and still maintain a very high NPS, a very high customer love. Um, that's where we need AI agents to step in and we need effective AI agents. So, I'll be honest with you. Uh, there's a lot of fluff about agents. I I hate it personally. Uh, I've been to a bunch of conferences where people talk about these amazing multi- aent systems that write a million lines of code a day. I don't think I can review a million lines of code a day. So my talk is focused on what do we do to build something that actually works and it may not be impressive from the sense of it's not multi- aent it's not writing a billion lines of code a day but as I'll show you the results we see that our customers love it so that that's the focus of my talk u I'll define all of this stuff on the slide in a second what is an agent because even that question is such a hard question to answer uh but at a high level An agent is an autonomous system that can help you solve a task end to end and it can take a series of steps to do it. Um the question is when should we use agents? What is an agent? What is effective? And how do we maintain a very strong customer love? So let's dive into each of these uh slowly. I have a lot of slides on this. So let's talk about what an agent even is because I feel there's a bunch of people I talk to. They say oh I'm going to build an agent. I'm like do you know what an agent is? And I get blank stairs. So let's let's define that right as you know with LLMs the if you're not training your own LLMs the only leverage you have is you can prompt it if you use chat GPT every day I'm sure a bunch of people use it every day you can write text it responds back to you so that's your leverage and that is captured in a prompt so if you were to use GPT today as the brain of your agent the only thing you can do is you can type text and hope it does what you ask it to do Right? But that's really just a chatbot. Chat GPD is not really giving you taking real actions for you and it can now. But let's say chat GPD 2023 could not have taken actions for you. This it's a glorified chatbot, right? But today you can attach these things called tools. And tools is also a very overloaded word. Tools is nothing but uh a tool is nothing but a component that helps you interact with something outside the LLM. Let's say I ask an LLM what's the weather in Sydney, Australia like. So it's it's going to call an API and fetch the weather for Sydney, Australia. And that is a tool. So when I say tool is nothing but most likely an API or a calculator or something that helps an LLM get the job done. So usually you pair a prompt with a bunch of tools. For instance, for customer support, I may want to get the information for this customer I'm talking to. So I have an API called get customer info. I want to check the delivery status of uh this person's credit card. So I have another tool and I want to search against the knowledge base about and some kind of FAQs. That's a search tool, right? So as you can see your agent is usually nothing but a bunch of text and a bunch of tools. And honestly that's really it. Like I I can dive into, you know, fine-tuning a bunch of stuff that as a nerd I love. But if you are building your first agent, I would not advise doing that. Just figure out what is the hero use case for your intent and just go after that. And that's one big lesson that I have learned personally that you need to build agents that actually do the job. You can't just build chat bots and hope your customers are going to love it. So as I said, the word agent is so confusing. Every company calls something an agent and for instance, Enthropic calls a bunch of stuff workflows and the other stuff agents, but OpenAI calls everything an agent, right? So, who knows what the right definition is. It's not settled, but this is my definition shamelessly copied from Enthropic's uh guide to build effective agents. So, let's talk about a workflow. So, a lot of you are software engineers, you understand that you have these DAGs, directed cyclic graphs of stuff that you want to do, right? You want to start with unstructured data, you want to parse the data, you want to train a model on it, and then you want to ship the model. So if you're following a predefined set of paths, that's a workflow. Uh at least in my book, whereas there is this category of agents called react agents, which mean reasoning and acting uh and action. And what these agents do is they think, they reason, they act. And at could mean they take they call a bunch of tools and then they observe the output and they can keep doing this in a loop until they're stopped. So conceptually these are a lot simpler to understand at a high level but they're really powerful. Um how many people have used cursor in this room? Quite a few hands went up. So most cursor agents are react style agents. They are taking as input your question about code. They call a bunch of tools and they keep iterating on their own output until they feel they should give control back to you. And most powerful agents should be React agents. Of course, there are other paradigms that could work too, but I would say the majority of powerful agents conceptually are very very simple to define. Your brain is the LLM. You have a bunch of tools which are actions the LM can take and that's pretty much it, right? That's how you build an agent. Now let's talk about what makes an agent effective. Uh because this is where I feel most people get it wrong and it requires a really high level of rigor to make your agents actually work. Now coding is a very good product market fit for agent. That's why you see companies like cursor releasing their own models like composer one openai entropic uh Google releasing amazing models to do coding. But you don't hear of many agents outside the coding domain that actually work. Uh if you remember OpenAI release an agent to do you know book tickets for you flight I I tried using that didn't work for me. Uh I've tried to make it do a bunch of other stuff as well which kind of worked. So I tried to make it uh kind of research products for me if I want to like shop and I have a limited budget. It it it does an okay job. But coding is at a at a different level at this point, right? Coding is an amazing product market fit. And code is a lot easier to verify. You can compile it. You can check the output. So, it's such an easy uh way for you to check if you're doing the right thing. So, the signal you have is is great. But how do you know if your signal you have in customer support is great or not? That's such a hard problem because the customer may or may not give you any feedback. And I'll dive into this in a second. But at a high level, you want your agents to be reliable. If you give them the same input, you want them to give you the same output almost all of the time. It has to be correct. So the answers have to be grounded in truth. Maybe there is a database you have about customers. You want the the LLM not to hallucinate customer information. And that happens a lot by the way. Uh you want it to be delightful. I think this is where I feel personally in Newubank is a a cut above where we really care about our customers having a great time with using our app. Uh you want it to be compliant. Uh as an AI researcher, my least favorite but very important topic. Uh compliance is important. You want to adhere to strict regulations and you want to bake this into your your agent, right? And finally, it has to be measurable. Are you doing well? How are you doing well? How much are you doing well by? Right? So all of these things have to be measured and the measurable part is usually where I start. I want to measure things before I improve them and I'll talk about an approach where we start with the measurement first. But in my mind, these are the five pillars of what makes an AI agent really effective. So to conclude this part of the talk, building trustworthy AI agents uh is important to make the lives of our customers easy. And the idea for us is we want to resolve the easy use cases, easy you know questions end to end and we want to elevate our XPers who do an amazing job with better context and we want to continuously learn and improve right we want so all the feedback you're getting from customers it's important you leverage it somehow most people are never going to fine-tune their own models because you probably don't need to uh although I will give you that if you fine-tune using reinforcement learning that's that's the rage in silicon value right now. Uh you will get some benefits for sure, but you have to ask yourself the question that do I care about getting 90% of the way there? Is that enough for me or do I want the extra 5%. If the answer is I don't need the extra 5%, let me get the 90% right. I would just focus on taking the best API model available today that you care about, maybe OpenAI, Anthropic, something open source, and giving it the best possible tools. And I'll I'll dive into all of that uh in the in the rest of the talk. Okay. So, we introduced what an agent is, what makes an agent effective, what is are the different criteria an effective agent should uh uh qualify for to be effective. Let's talk about the evaluation problem. And this is something that uh is very close to my heart. So, when you most CEOs in the world who do core customer support care about metrics like this. So for people who don't know what NPS means, it's called it's net promoter score. Think of it as a thumbs up and a thumbs down. So you are measuring at an aggregate level how many people say this was a good experience. Sometimes when you talk to customer support, you get a survey at the end. So this is that survey, right? So a lot of CEOs really care about this because it gives you a snapshot of do my customers love what I'm offering to them or not. Then you have something like self-service resolution rate. How many use cases were solved by AI end to end or how many were actually sent to humans and how how much time am I uh taking to solve open tickets. So if you get a very high volume of tickets like we do, we get an insane volume of tickets. How do you solve them effectively so that you don't overload the Xpers and you also make sure the customers get their queries solved in a really really simple and fast fashion, right? But the problem is, and it's so funny, not many people talk about this, I don't answer surveys. A lot of people don't. And I I'm not going to tell you the exact percent of people who answer our surveys, but it's not very large. And it's not very large for any company. How do you get out of this, you know, situation where your response rates are low? Uh, and sometimes the feedback is delayed. Sometimes the customer may answer surveys hours later, days later. And you usually are using small sample sizes. So those of you who have done a lot of AB testing, this is the nightmare, right? Because you don't have enough uh power in your AB test to actually tell is variant A better than variant B? I don't really know, right? So this is a huge problem and it's a problem for us as well. And how do you kind of deal with it? It's it's I mean you can't statistically you can't cheat and somehow get uh small sample sizes to give you statistically significant results but you can do something else and what you can do is the clicker stopped working. I can use the laptop. Works again now. Works. Um, I am a big fan of I think this is a quote from Sergey Brin or Larry. I'm I'm not sure to be honest, but let a thousand flowers bloom. All my career I've done a lot of AB testing. If I have an idea, AB test it. don't trust I mean intuition is good but if the data doesn't back up your intuition your idea is terrible right so how do you do a very large amount of Amy testing and trust your results and that's that's the really hard problem because when I'm building agents I want to test 8 n 10 different variants at the same time but it's impossible to assign a very large amount of traffic to each uh each variant I think it was an issue with my my laptop yeah U so how do you do that and at the same time you cannot lose customer trust right you don't want to test you you don't want to be casual and just test something in production that even destroys the experience for even one customer you don't you don't want that to happen so how do you safely do AB testing and you do a lot of tests because unless in any use case we don't do at least 10 AB test I personally I'm never happy >> so what what do you do I mean this is not something I came up with this is something a lot of people have been doing for a couple of years since Silicon Valley you build evaluations and this is also a word which is become kind of a dirty word like agents everybody talks about eval but what the hell do you even mean when you say eval right so I'll define that in a second but for us this is this is how we operate we never ship anything to 100% automatically we test everything on a small amount of traffic and if things are going bad we automatically uh reroute to to our human experience who are amazing who do a much better job than AI today. Right? So we want to be able to create evaluations and these evaluations will test every single conversation. They don't only look at conversations where the users responded. So now suddenly from a very small response rate you have evaluations for every single interaction that happened. So it's a lot easier to reason about is variant A better than variant B. This is an example of an eval. So let's say you have a conversation that happened with a customer and you have let's say a thousand of these right now you can create another LLM as a judge evaluation to test a specific and that's this part is important a specific aspect of the conversation let's say I care about correctness against the knowledge base then I will write a prompt you are a fact-checking auditor verify agent claims against the knowledge base and I supply the entire conversation to this evaluator and I use another LLM. This LM is a different LLM potentially from my agent LLM to run through the conversations that happened in the previous day in the previous few hours maybe right and now the question is your eval has to be uh uh trustworthy right so the if if the agent is not trustworthy and this eval is also an LLM as a judge how do you know this eval is trustworthy that's also a question so we will resolve that in a second but let's say you have come up with this amazing eval which checks one specific aspect correctness. You may have something that checks for conciseness. Is your agent just you know spitting out a bunch of text which a wall of text which is just insane for your customers to kind of read? U is the agent actually sticking to the intent. Let's say I want the agent to read out to another agent if the question is about a different topic. Is it actually doing that? So figure out those five to 10 important jobs to be done or things that the agents stick to. Write them down on a piece of paper. And what I do personally is I actually read the chats. And that part unfortunately you cannot skip. It's painful. It's really really painful. Uh it's very boring but it's also interesting at some sometimes because some interactions are really fun. But uh you actually read the conversation and you you figure out what do your customers even want? What are they how are they interacting with maybe the human experience or if there's already an AI agent in place are they angry? Are they upset? How many turns of conversation are happening? How big is the response from the LLM? Um what is the correlation of you know thumbs up thumbs down with the actual conversation? So spent a week doing that for every agent. And this is the the hardest part, right? Once you do that, you can start creating these LLM as a judge evaluations. Now, some evaluations are not LLM as a judge. You maybe just want to check how many turns of conversation happened. Then it's a more deterministic evaluation. But most evaluations are subjective question. >> Well, you have the X peers and you have all the logs of all their conversation customers. So are you using that to train businessmen know none at all or >> Great question. Uh so this system um is more about evaluation. If you were let's say fine-tuning a model, I I would I would use that. I would use that for sure. If you are not fine-tuning the model, then I would still use that information to create the evaluations for an agent if it doesn't exist yet or I would modify the prompt of the agent by giving it samples like this is an example conversation that went well. This is something that went south. >> Yes, exactly. A few few shot prompting, right? Yeah. Cool. So with evals you can iterate and test fast. But I didn't tell you anything about how to write an effective eval yet. But don't worry about it. I will tell you exactly how to write an effective eval as well. This is really the secret for most companies. If you cannot measure it, you cannot improve it. So focus on measurement like be fanatical about it. And we are very fanatical about it. We want to measure every single aspect of everything we do, right? Because once we measure it, it's so easy to see where the first 80% of the gains come from. And usually it requires 20% of effort to get the 80% first 80% of the game. Right? So I always say if you don't know if nothing exists just measure. Okay. Now we talked about the intro what an an agent is. How do you measure things? You create evaluations which can be deterministic or LLM as a judge. Right? Now let's talk about leverage over AI models. This is also quite an interesting topic. Like if you asked me one year ago that should everybody fine tune their models I mean I have fine tuned models all my career I will I'll be like of course but in the last one year since the rise of uh models uh reasoning models like 03 now GPD 5.1 and 2 uh and anthropics models Gemini's models these frontier models have become really really good so my belief in should everybody fine-tune has reduced considerably I don't think everybody should fine tune their models you can get a lot leverage out of existing models and your only leverage really is how do you prompt the model but luckily there's so much literature on it now that you can be very principled about it right now you can't change the weights of the model you can't fine tune I I mean some companies offer APIs to fine-tune but you you have no visibility on the model weights right u you can change the prompt you can change the samples you provide to the prompt in a few short way you can uh change the schema as of your tools, then the description of your tools. You can write evaluations, but there's not much else you can do. But honestly, you don't need it. Like, I'll show you some examples where it'll be so clear that these frontier models are so good that if you just know how to instrument them and use them well, you can do an amazing job. So, let's talk a little bit about prompting. I I'll come back to the prompting topic later as well. uh but at a high level this is pre2022 when people say zero short prompting you want to prompt the model with no examples you just say okay do this thing for me like change this text from English to Portuguese let's say right and few short means change this text from English to Portuguese there are five examples of how to do it that's five short right this was the the early way to do this now we are in the reasoning era and this is so new people don't realize how new this stuff is it's literally like 12 to 14 months old with OpenAIS 01 model. These models can now think and they the chain of thought is an older paper but before chain of thought after chain of thought like how do you integrate this into the model itself? You have this reasoning phase where you give an input the model reasons and the more it reasons the better it gets. That's another kind of astounding thing. Um so most models today they reason over the your input your prompt and then they act. And one very important thing is models aren't interchangeable. If a prompt works for DeepSeek R1, it's not going to necessarily work for Gemini. And we learned it the hard way. When we switched from GPD 4.1 to GPD5, suddenly our agents just stopped working because GPT5 is it's a much better model, but it assumes a lot less. And we didn't realize that the implicit assumptions of GPD 4.1 were actually really helpful for us. And once they went away, we were like, "Oh god, like we need to rewrite the prompt again uh for GP5." So your prompts are also very much tied to a model, a model version, uh a provider, etc., right? So prompting is powerful. Uh that's the only leverage you have if you're using a frontier model or even if you're deploying an open source model like Quen or something. But they're brittle, right? Like even a few lines of change in the prompt can lead to huge changes to output. I'll give you an example. Uh we built an agent to help our uh customers and there was a particular parameter in one of the tool calls and GPT 4.1 would keep hallucinating it. 80% of the calls have to hallucinate the input to the uh to the model and we changed literally three words in the prompt and it went away and I I cannot explain why that happened. It's so much duct taping but unfortunately we had to do it which also tells you that humans should not be prompting right like for those of you who have used DSP or prompt optimization in the past it is a terrible idea I'm sorry I'm not going to offend any product managers here for product managers to write prompts it is a terrible idea uh for engineers too but engineers can at least run the models and you know quickly fix those things if you're not close to the technology you should not be writing prompts what you should be writing is evaluations. What your taste of good and bad is what your taste of evaluation is. That's where I think product managers are very very helpful in this process. Engineers uh are obviously I mean at le I'm not very useful with the taste part of things because I'm not a product guy. But prompting these models is not something that I personally believe should be democratized. It is really pointless because they are so brittle. You should let the machines write the prompts for you. So you have another LLM write the prompt for you, right? And we'll talk about that. Now there was a question about fine-tuning. Uh I have just one slide for it because I don't want to all this is my favorite topic. I love it. So I wish I had more time. Uh very broadly three different ways to do it. The hottest topic of course is reinforcement learning. You use some kind of a very some reward to guide your model towards you know some output but it's extremely expensive. lot of GPUs the signal is so sparse you can actually do lower training with rank one and pretty much get lossless performance compared to full fine tuning I don't recommend it if you don't have a big GPU budget most companies don't right the second thing which is my favorite because I've written papers in this area is preference tuning so this is a lot like going from GPD to chart GPT where you know for a given prompt one good answer one bad answer and you can kind of push the model away from bad answers and push the model towards good answers And the simplest thing is you do supervised fine tuning where you have access to high quality data and you just fine-tune. Most people don't really want to fine-tune. But what I'll just give you this that you can use a really small model to achieve the almost the same or even better accuracy as frontier models on specific use cases. But your model will forget other things. Model capacity is a thing. So if you are okay with the model solving a narrow use case, then by all means fine-tune. But don't expect the model to still do well at tasks that it wasn't fine tuned for. It's going to drift away from the original model and it's small. It's going to forget a lot faster. So those are all issues. But if you want to be narrow, you should absolutely give this a shot. My go-to tools are SFT and preference tuning. I kind of reserve reinforcement learning mostly for use cases that are just very different. I think coding is a great use case where RL has worked really really well and everybody uh if you throw a stone uh in Silicon Valley, you'll probably hit a person who's doing reinforcement learning on their laptop. Uh but I I do feel it's a little overblown. Uh you don't need to do most of this. Try to use an API model first. If it cannot solve a use case, then maybe try these tools. Okay, so we covered a lot of things. Let me just check on time. Uh, okay, I'm I'm on time. Okay, that's good. U, what's the blueprint of an effective agent? We talked about evaluation. We talked about what leverage you have over models. You can either prompt them or fine-tune them. But what makes a good effective agent? We talked about React, right? So, conceptually makes sense. React is very simple. An LLM, a bunch of tools, keep going until you stop, right? But in reality there's a lot of other infrastructure that goes into it. Now for every company the question is build versus buy. Do you build it? Do you buy it? And I think it's it's quite an interesting question. You need memory. You need observability. You need somewhere to create evaluations. You need to measure latency. You need some kind of PI reduction or some kind of guardrails. So you need a lot of things to go right for you infrastructure wise to build agents and uh what we do is you know we want to optimize not just for you know your prompts but everything around the prompts because you want rigorous evaluations you want the APIs to be you know u stable and documented and versioned because you may change the prompt or the APIs and you want your knowledge base uh uh to also be good because you want to make sure that if you have FAQs they are being surface to the users but there's a lot happening here right uh a lot of it can just be bought honestly there's so many companies today like uh doing an amazing job so if you're building agents and you don't want to build your own uh infrastructure just just buy it or you know just just use other companies and really focus on the prompts and the tools and everything else can be something you use for instance if you want to use an SD DK port for building agents lang chain lang graph ADK there are so many options AGNO there there are like a million of these things around uh if you ask me which one is the best I don't know I don't think any one of them is the best because this changing so fast u eventually we'll see what happens but this space is so new that I I cannot say that one is better than the other this is kind of how we set up our Genai platform so we We have safety, we have so at a high level, we have agent, actions and data, right? You just want your agent to be able to have access to data or take an action and that's it, right? The right data, the right actions should be able to give you enough leverage to build the right agents. Accessing data is a tool or multiple tools. Taking actions are also tools, right? So, we have a bunch of real actions we can take. We can fetch information about customers. We can modify information about customers. We can change something about your credit card. Those are all new actions and there's some guardrails around that. On the data side, we can give access to your information. We can fetch information from an encyclopedia. Uh we have some kind of a memory concept. The agent can remember within the session what is happening. U and we have uh in the middle your thinking brain your LLM gateway either use openAI some AI model or fine-tune your own if you want to deploy the model and then there's a bunch of authentication and prompt management stuff uh around it. So nothing revolutionary uh but I like the how we've set it up. It's quite granular uh modular rather and it's kind of easy to reason about different things and different teams can now focus on different parts of this right like we have dedicated teams focusing on just the tools part uh on the data part and some teams focusing on just the prompting right so it creates this nice setup this is an example of evaluations uh this is the only slide I'm allowed to show uh so you can see on the top right we have it's it's a little small but we have different colors for different variants and there are five different evaluations. These are very specific things we measure on one use case and you can see that lower is better. Some treatments are doing better than others on some EVAs and you can see most of the treatments are doing better than control and these are the kind of numbers you can produce within a day. Right? So if you have three different ideas, test them and tomorrow I can tell you which one sucks and that to me is the best part. Right? Cuz I have a million ideas most of them suck. I want to know which ones which ones suck really really fast. Um this is an example of our dashboard where you can see traces of conversations etc. And trace is just a word for the entire conversation that happens with one customer. It includes LLM messages, uh customer messages and tool calls. So you you have all the information you need, right? Um then there is the rack part. I won't bore you with the details here because there's a it's it's all very traditional stuff. But essentially, it's a it's just nothing but a search, right? You have a bunch of documents in your uh knowledge base. You want to store them and if the user makes a query, you want to search against that, right? If the user says, "Oh, can I have a credit card for my 12-year-old?" Then you need to know that your knowledge base has this information, fetch it and present it to the user in the right way. Right? If the user suddenly says that uh uh you know is new bank operating in uh the US, you need to know the answer to that question, right? So this is the the search part and then of course the PI red action and the prompt injection part. This is this is like really important and unfortunately I I'm I'm terrible at this. I don't know much about AI safety. Uh but there are a lot of people who do. Uh how do you read information before you send it to the models? How do you ensure prompt injections are hard to do or don't happen? All of these are things that our team is thinking about. So, uh, and different people work on different parts of the of the stack. Uh, this part is also really important. Unfortunately, I have only one slide on it, but this is something we're working on. If you want to launch a variant, how do you know it's actually working before launching it? Right? That's so hard to do. So, can you simulate stuff offline? Can you simulate a bunch of conversations with your new treatment and say, you know what, it's I think it's okay to launch. If it sucks, it's okay. But I don't want it to start leaking information or start, you know, spewing hate at to my customers or just being useless, right? Uh we don't that that's something we are we cannot tolerate. So we have a system internally where before you have once you create a variant, you have to call we call it bug bashing. Uh terrible name. I came up with it. uh you want to do a bug batch uh before you launch uh and it took used to take us weeks now takes us days or sometimes a couple of days. So we have this process. You take a couple of days to kind of test things out. You launch within a day or two. You know it's working or not. You mostly kill it. Most things we kill, but some of them do really really well, right? And that's that's the thing that progresses to the next gate. There are entire companies that do this today. So I'm forgetting the names. U but but there are a lot of companies that just specialize in simulation. And I've seen companies uh partner with these simulation companies because these simulation companies can uh try to attack your agent in the ways you cannot anticipate easily and I think that's quite useful. Okay, we talked about the blueprint of an effective agent where we discussed a lot of different things. How do you do data actions the brain etc. Now, let's dive a little bit deeper into the prompting part, right? Because I promise you that I will go a little deeper than just surface level about prompting. This is something I stole from a blog. Uh I don't know whose blog this is, but I I I give them the sources right here. Some people when confronted with a problem think I'll know I'll use uh regular expressions. Now they have two problems. They have to write regular expression, right? Same thing for prompting. Some people when confronted with problem think I know I'll use prompting. Now they have two problems. So as a human it's just insane to be able to write like a fivepage prompt. I I cannot do it. Uh and I feel none of us should be doing this. This is where uh I'll I'll skip this slide. I think it's just a repeat of prompting is brittle. We treat prompts like code. And this is something many companies do. So I I won't claim that this is something only we do. But segment your prompt. Break your prompt for the agent down into pieces and have semantic versioning for each piece. That's what we do. And let the versioning evolve. One piece may move faster than the other, right? Maybe tone and style is something you share across agents. So have one central place where you can have tone and style stored and maybe a lot more people product managers or engineers uh update that and maybe the tooling part is moving less uh you know moving more slowly. So these are different things uh different ways in which we segment our prompts. And now comes the my favorite part the prompt optimization part. How many people have used DSP in this room or heard of DSP? Oh, not many. I think you're missing out. You need to go and check it out today because this is this is going to change your life. And DSP is an amazing framework started at Stanford. And the idea is I am going to create a blackbox. I'm going to define inputs and outputs. And the black box is the prompt for my LLM. And I'm telling the LLM, hey, this for this input, I expect this output. Now, figure out what prompt can get me there. So, you are flipping the problem on its head, right? Because let's say you you are doing uh I don't know maybe summarization and you're telling uh this blackbox that hey for 10 different large documents, this is the right summary. learn from this and figure out what the best prompt is which will give you the highest accuracy possible. Now you you can measure this. You can actually literally say that my accuracy went up from 80% to 90%. And once you see the prompt that comes out of this framework, you will never write prompts again with my hand ever. Right? So there is this amazing optimizer. It has many optimizers. My favorite is this optimizer called Japa. This is work from Berkeley. uh we working with them on some projects where they are very sample efficient. You don't need a lot. You need like a hundred, maybe even 50. Just sit down, label a bunch of stuff. It's not hard. It'll help you a lot in even reading conversations, right? And let JPA or DSP figure out the prompt. Um and that's where I feel manual prompting for a bunch of reasons is just insane. But the most important part is this model updates, right? You update the model, your prompt no longer works. And this happened to us a bunch of times. Right. Um, quick time check. I have 10 more minutes. So, at a very high level, how does ZEPA work? It analyzes a bunch of traces. Uh, so you give it input output. It's going to reflect over the reasoning. It's going to reason over the output. It's going to keep improving the prompt. Don't worry about the details. Go read the paper. It's very, very interesting. Uh sorry you want one minute to go back. Okay. All right. So that part the automated prompting part we'll use this in the EVA's first approach. So what gets measured gets managed. This is something my father says uh and I think he's right. uh you need a bunch of different ways to test how good your variants are and what are the pillars of uh rigorous evaluation. Your evaluation should be binary first. This is something I took from a course by I forget I'm going to butcher their names. I think Syria Shanka from UC Berkeley and Hame Hussein they work closely with OpenAI and they have this Evals course $3,000 quite expensive. Uh you can take the course I think it's a very good course. Uh so your eval should be binary. They should be repeatable, observable, blah blah blah, all the good stuff. And how do you create a new eval? You first define a success signal. You collect test cases. That's where you read the conversations. You actually label the conversations with your evaluation. And you create a heristic judge, which is, you know, more deterministic or an LLM as a judge uh prompt. And then you have your eval. And how do you optimize the eval? This is where we connect prompt optimization to eval. So all our evals today are optimized by Jeepa. So I don't have to write any prompts for evals anymore. So I have one less problem on my hands. Right? So what Jeepa does for me is I'm going to give it a really crappy first prompt and I'm going to say you figure out how to improve it. But the output for the let's say it's a binary evaluation is 01 for these 100 uh use cases, right? 100 100 samples rather. So Jeppa is going to internally figure out how to do stuff and this is a slide I took from some of our work where we use Japa. So before Japa if we look at the correlation of different models on the output is so low you can see that only the diagonal is close to one but after Japa look at the difference right huge. So now the same prompt works on all of these different models. Huge difference, right? This is the prompt. So this is a really crappy prompt. I think I think I wrote it very very bad. But then it's hard to appreciate this like in in 30 seconds because you don't have the domain context, but Jeepa did an amazing job of taking this and turning it into something actually useful, right? So for evaluations now everything we do is optimized by LLMs or or or AI. Okay. Uh since I have only three minutes left, I'm going to quickly run through a bunch of lessons from one use case. So, one use case we had uh is uh card delivery agent. So, where is my credit card? I ordered a new card. I don't have it or I ordered it 2 months ago. Where is it? Or my dog ate it. I need a new one. Blah blah blah. Very very important for our customers, right? So, what's our goal here? We want to increase self-service resolution rate and we also want to increase NPS the happiness factor for our customers and what we saw is for AI so so first of all our blended NPS is very high we ensure that if AI is doing a terrible job our Xpers who are amazing will do a job of maintaining customer happiness but as AI gets better as our agents get better we see a very large increase in TNPS transactional NPS and what are the things we use better operational procedures the AI agent should use the same procedures as the human agents it's a data problem right how do you take the data that the human agents use and supply to the AI agent evals and prompt optimization another big lift in performance and new tools for real actions another big lift so we are seeing a pretty large lift in uh you know uh customer happiness as we make our agents more automated and give them actual stuff to do rather than be glorified chatbots Lesson one, real data and real actions. Don't make a chatbot. Work on tools. So before you build an agent, make sure you know what your agent is actually solving. Build the right tool. For instance, if you're doing credit card delivery, have a tool to order the credit card automatically. If you're not doing that, that agent is useless. Right? Number two, uh prompts are model dependent. I already mentioned this a bunch of times. We use 4.1. The moment we moved to five, everything broke. So new prompts for GBD5. But now you have prompt optimization. So you can maybe use it to evolve the prompt. Finally, helloations are real. Uh this is very important. I struggle so much with this. It really pisses me off. But there's nothing you can do except prompt better or if you're hallucinating and sure you can kind of catch and fall back to human. So in our case in our use case if we you know kind of figure out that a tool call failed we just go back to human because we don't want the customer to be unhappy. And finally effective tools are important. You need to move most of your business logic to tools. Don't let the agent figure out how to chain tools together. If three tools need to be chained together just write one tool which composes the tools together. And I found out the hard way. I was chaining a bunch of tools and they kept hallucinating. I'm like I'm an idiot. I should have just written one tool which can just call three tools internally. So if it's deterministic, don't make the LM do the work. Right? So this is really really important. Okay. Uh nothing revolutionary here. Data and action layers are the real mode. The frontier models are getting really good. Uh focus on data, focus on action, use a frontier model. I can guarantee you your first iteration of your agent will be a lot better than you expect. So thank >> [applause] [music]
Video description
Scaling autonomous AI agents to serve over 127 million customers introduces significant technical and operational challenges. At QCon AI New York, Nubank shared a behind-the-scenes look at how production-grade AI agents are designed, evaluated, and continuously improved at scale. The session, presented by Aman Gupta, Principal Machine Learning Engineer on Nubank’s AI Core team, explores a pragmatic and data-driven approach to AI in production, including automated evaluation pipelines, LLM prompt engineering strategies, and the use of feedback loops to ensure reliability and measurable impact. For professionals working in data, data science, machine learning, and AI engineering, this talk offers concrete lessons from real systems operating at scale. Looking to work on AI systems with real-world impact? Explore opportunities at Nubank: sou.nu/jobs