The ThinkND Podcast

RISE AI, Part 7: Generative Computing

Think ND

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 1:07:12

Episode Topic: Generative Computing

Move beyond experimental chatbots and toward a robust IT stack capable of supporting autonomous agents. Sriram Raghavan, VP of AI Research at IBM, dismantles AI hype to reveal the “Generative Computing” paradigm. Learn why 95% of AI pilots fail and how a principles-based approach—leveraging small, fit-for-purpose models and rigorous governance—can transform enterprise uncertainty into secure, scalable, and efficient agentic applications.

Featured Speakers:

  • Sriram Raghavan, IBM

Read this episode's recap over on the University of Notre Dame's open online learning community platform, ThinkND: https://go.nd.edu/308d51.

This podcast is a part of the ThinkND Series titled RISE AI. 

Thanks for listening! The ThinkND Podcast is brought to you by ThinkND, the University of Notre Dame's online learning community. We connect you with videos, podcasts, articles, courses, and other resources to inspire minds and spark conversations on topics that matter to you — everything from faith and politics, to science, technology, and your career.

  • Learn more about ThinkND and register for upcoming live events at think.nd.edu.
  • Join our LinkedIn community for updates, episode clips, and more.

Welcome

Speaker

All right, in the interest of time I say that always, and then I'm the one who introduces delays. But now, in the interest of time, let's, let's just get started. You know, I think, you know, we had an, we had an amazing couple of sessions this morning, including, you know, a delightful conversation with Father Benanti and Aarti, and then as well as, you know, we had some panel discussions on AI ethics and, uh, role of AI in city and local governance. You s-- you heard from the leaders in the county, city, and as well as mental health facilities in town. And, and I'd-- we'd love to, you know, bring in as... And as we've been thinking about how do we look at responsible, inclusive, safe, and ethical AI, which is the theme of the conference, I'm deeply ex-excited about, uh, the talk that's, that, uh, Sriram is gonna offer to us, which is building robust, secure, and efficient generated AI, generated AI applications. And before I introduce Sriram, it was about back in December, uh, Sriram and I first spoke. He was in India at that time, and I think I was in Vietnam at that time. And then Jim Cavanaugh from IBM was in New York So there are three of us in three different places, uh, uh, in the world. And when he... And, and I'll let him talk a bit more, but when he explained the ideas, and they were fresh, I believe we had just talked about in a board meeting or something at that time, and he was gracious enough to share those with me on what generative computing was. I saw that as where the future is headed, right? In terms of how the computing stack and technology is gonna be, and, and we learn more about, more about it from him now. But to introduce him formally as well, uh, Sriram is, Raghavan is the Vice President of AI Research at IBM Research. So essentially all aspects of research labs, anything to do with AI reports up directly to Sriram, whether it's Zurich, uh, Israel or, uh, Ireland or India, the two big labs in India as well. And he, in fact, he was-- he helped set up the big IBM research lab, uh, in India as well. And he leads a worldwide team of scientists, and his work is responsible for not only doing the innovation as part of the research, but transitioning them out, those cutting it all in the work that they do into a twenty-five billion dollar software business at IBM. Twenty-five billion dollars. So essentially the research, innovations, and discoveries that are happening in, in the research. And as academics, we know it's incredibly hard to do that last mile challenge. So kudos to you for, for really, uh, uh, taking the journey. And, and as we were talking earlier, he got his PhD at Stanford, and the first job after his PhD was IBM. Uh, and now he spent twenty-five years there. As I said already, he was the Director of IBM Research Lab in India, CTO for IBM in India South Asia, and his, uh, expertise spans database systems. That was his core, uh, PhD research, but now NLP and distributed systems, and truly a true pioneer. And if I may evoke a conversation you and I had, a deep conversation about what does it mean to be a computer scientist yesterday. And, and because twenty-five years, we are literally peers, and we were studying the same things in computer science PhD programs when we were students and Like it, you know, we have to sort of think about what does computer science mean into this paradigm, and we feel we still it is very true to what we believe we were taught and what we believe it should be. So we're gonna write that together, Sreeram. We should. So again, please join me in welcoming Sreeram and, and, and share his idea and vision about what is generative computing. So thank you, Sreeram

Speaker 2

Thank you, sir.

Speaker 3

Thank you. Thank you. Good afternoon, everyone. Uh, my pleasure to be here. Thank you, Ritesh, for the, for the warm introduction. Um, so I thought I, I would, I would build up to generative computing a little bit, uh, by first talking about why we are thinking of this like a new computing paradigm. 'Cause sometimes it's hard to cut through all the noise. There's sometimes an obsession just about models and model sizes for a while. Then there is an obsession about infrastructure and spend, and that's what you hear in the news. But at its heart, we firmly believe we are... what we are witnessing is a new kind of computing element entering our, if you will, computing stack. That's gonna let us build new kinds of applications. We are building new kinds of applications, but we have to understand them formally and better to build them safely, securely, and efficiently. Uh, and I'm sure you're all seeing a lot of the news. There was all this, this MIT article that said ninety-five percent of AI pilots fail. Nobody's doing anything useful with it. There was a Forbes article. These get a lot of airtime, and certainly there is a challenge in between the promise of AI and driving that value in enterprises. And, and it's because we're seeing this transition happen so fast. So before companies can grapple with what do I do with an AI that can generate for you, we went from that to assistants now to agents that are supposed to do work for you. And it's happening in a matter of a couple of years. It's a little bit hard for enterprises to grapple with how do I set up the systems, the platform, the development practices to truly harness the potential of the art of the possible and drive value, right? So I'm gonna wear the lens of what I'm gonna, for lack of a better word, call enterprise AI, and sort of build up to what are we doing in enterprise today to help govern AI? Why is fit for purpose efficient approaches really, really important to deliver value? And then I'll get to sort of the generative computing.

Enterprise AI Stack

Speaker 3

But to contextualize this, we have a very simple point of view as IBM on what it takes, we believe, to really get value from enterprise AI. I think the first is we truly believe the answer isn't ever going to be like this one magic model that rules the world. It never is. The world is too diverse. You have lots of needs. You have lots of things you need to get done. So any non-trivial enterprise, we were the f- we were saying it for a while, and now I think, I think there is no debate on point number one. You are gonna need multiple trusted fit for purpose models of different sizes and complexity to really move the needle. The second is how you leverage enterprise data is gonna be super, super critical. It doesn't all mean that you have to stick that data into a model, but yesterday we heard a great talk about how complicated notionally simple idea of RAG is to get right. RAG is one way to give enterprise data available to the model. There's lots of techniques needed. But as an enterprise, if eventually everybody has access to the same collection of models, how are you gonna differentiate unless you can bring your data to bear in unique ways? So that's bullet number two. Third, you're going to then have to build agents powered by these models. We'll spend a lot of time on agents. You are going to have to take a platform approach. The mistake of thinking of AI as a collection of individual use cases is what will cause people not to get value. You have to recognize that eventually you have to retool your software stack, your IT stack inside a company, and think of AI not as some collection of models, but really a platform on which you're going to build lots and lots of applications. And then obviously, consistent with the theme of this conference, a focus on governance is critical. There's almost no meaningful industry, certainly not regulated industries, who can deploy AI with an end-to-end governance strategy. So this is a, a ridiculously simple approximation, but the reason this, this, this picture is useful is, as we think about what technology we bring to clients, we have this very simple, consistent picture that we use. Now, I'm not populating it with IBM products. My bo- job here is not to sell the IBM products. But this is a simple view of all of the layers of an IT stack today that enterprises have to think about when we think about AI. You have to think ground up from infrastructure to your horizontal platform, to what it means to data and the emerging agent platform and the applications and domains. And depending on who you are as a company, you may get the bottom two layers from one or more cloud players. You may have those layers be offered by one cloud. You may always have infrastructure on your own. But you are going to be-- You'll have to think of implications across all of these different layers, and governance as a cross-cutting topic. Now, there's a lot I could say here, but I decided I would focus today on three things. First, how have we seen AI governance evolve? What has been the journey of ev- evolving governance from classical machine learning today to agentic governance? And I'll just give a little flavoring of some of the technologies we are building and how that gets packaged into a product. Second, I want to talk a little bit about models. I'm very pleased by my team. We released a, a set of October 2nd, which is just a few years, we released in the open a set of really small, tiny fit-for-purpose models, and I just wanted to give you a flavor of how much the technology has progressed and what small models can do. And then the third is then build on those two to introduce the idea of, of generative computing. Now, throughout, I'm-- you will see, uh, paper links and so on. So a lot of what we do is in the open. So if you're seeing GitHub links, we encourage all of you to go. Please join those GitHubs, break our software, play with it, use it, leverage it. So you will see that throughout, throughout the talk. So that's my plan. Three parts. By the way, I've never done these three parts exactly together, so if, so if I'm way out of time, I will skip a piece and go to the other, but I will certainly touch generative computing for sure, 'cause that's in the title. All right, governance. So what this picture is trying to say is, as AI has evolved, you have to think of governance as adding layers and layers of complexity. So we were doing traditional machine learning. It wasn't as if AI governance wasn't important in traditional machine learning. Model risk management and financial services is a very mature discipline. And there you worried about distributional fairness, you worried about adversarial robustness, you worried about explainability, you worried about fairness. Then you added the concentric circle of generative AI. Now you had to add additional concerns. You had to worry about hallucination. You had to worry about faithfulness. In classical machine learning, you built a model on your data. Now, in generative AI, you are getting a model from somebody, God knows what they trained it on. So data and source attribution became important. Then you added another layer of agentic AI, and now you have to worry about, well, my AI is not just giving me answers, it's calling tools. Is it calling the right tools? Is it authorized to call the tools? How do I deal with, you know, human action versus agent action? So the point of this chart is, as we have evolved AI, the set of governance considerations have evolved, but they are additive. It's not as if in agentic AI, when you worry about tool calling faithfulness, you can ignore fairness. So you have to build up these governance considerations in layers. So every evolution of AI has added additional things that, quote-unquote, "We have to worry about." Now, what I wanted to do was give you a little flavor. It's going to be a little bit like a, a tour of technologies that we have been building for gen AI governance and how we are evolving it to agentic AI governance. So this picture is-- the point of this picture is governance is not one thing. There are a set of questions you have to answer. If you take this simple picture of an LLM and a user or an application or an agent interacting with the LLM, when you think about governance, you have to ask a set of questions. You first have to say at a use case level, what are the risks? You have to ask, what can I do before deployment? What do I have to do post-deployment? How do I-- what does explainability mean? How can I make sure that if I have certain behavior, I can steer the LLM in that way? And then how do I do constantly risk management? And what you see on the rightmost is just a collection of technologies that we have been building in IBM Research. Actually, some of them jointly actually with the Notre Dame lab to address these pieces. So this is not a product. These are technologies that go into a product. But the point of this chart is just defining the set of things that you have to do in governance is a pretty challenging task, and you have to look across the life cycle. So I'm gonna pick a few of these examples and just give you a flavor of the, the technology. So the first one on the top is, is an innocent question, but actually the ones that our clients most get stuck on, which is: What are the risks I should actually even worry about? Help me navigate. Everybody's telling me li- risks. I-- you could give me a huge taxonomy. NIST has a taxonomy. IBM has a risk atlas. Everybody has a taxonomy. Help me understand for a particular use case, what risks should I even be guarding about? 'Cause certainly the set of risks I worry about if I'm building a recommendation engi-- uh, recommendation engine to, to recommend suits for Nitesh is different from the set of risks is why I'm doing medical diagnosis. They're very, very different in nature and quality. So one of the first things we did is this thing called a risk atlas. This is a public document. It actually started when IBM established an ethics board inside the company. Uh, I think we're one of the first ones to do that way in, in August 2023. And we recognized that even within the business, there was a question of how do I think about the risk? What is the taxonomy? What risks are there on input? What are the risks on the output of the model? So we've been evolving this risk atlas, and this is now a living document out there And what we did not anticipate that has happened, that risk atlas actually has a one-to-one mapping to our product documentation. So when somebody actually buys a product called Watsonx Governance from IBM, when they deploy a solution, they can go understand the risks, map it to the taxonomy and say, "For these risks, these are the mitigation measures that I should be employing." So the atlas has been very, very powerful, and we have been continuing to update it now. And I told you this concentric circle where you have to add more and more considerations. We have been updating it to now deal with the world of agentic risk as well. So this is a resource. I don't want to spend time on it, but I just wanted to give you a flavor that this is a really important resource. And then working actually with our ethics lab here, the team built this thing called a Risk Atlas Nexus. And the goal there was, okay, great, you have a taxonomy. How do I navigate the taxonomy? So can I come in and describe... And you can go here, you can go actually play with this. Can I describe the intent of an AI application and help you navigate the taxonomy and tell you what risks are applicable, what mitigation measures I should use? So give me a way to slice and dice that taxonomy, given some input in terms of what application I'm looking to build And in fact, in the e-even what we have built, just like you have this way of going from an intent to risks, we have actually built a AI-based navigation advisor for the taxonomy, where you start from the intent of an application you're looking to build. We have a way to get questionnaires, and today what we are doing is evolving the questionnaires to meet regulations. So if you're telling me that you're gonna deploy a medical application in the EU or you're gonna deploy, uh, a financial application where, for example, the New York State Housing Law might become eligible. Between the questionnaire and the intent, we guide the user through what metrics you have to observe, what data you have to gather to meet regulatory requirements, what instrumentation you have to do, all the way to what runtime monitoring that you have to do. So, so sort of the takeaway is, in governance, one of the first things we had to address was solve this taxonomy problem, help people navigate the taxonomy, map from use cases to taxonomies, and then help them navigate this to understand how to set up the system. So today, one of the things we do based on this picture is we provide for our customers something called regulatory packs. You can literally click down and buy a regulatory pack that says, "I need to have download, uh, the package that tells me how to deal with this particular law in this geography." And it'll tell you, "Hey, if you are deciding this according to the law, here is what you must gather. Here is evidence you must do. Here is the frequency of your monitoring." This is very similar to what happens in IT systems. If you want to certify your IT systems, there is a set of policies and controls that you have to obey. This is that in the AI world. So this was, it was one of the first things that we had to do.

Granite Guardian Guardrails

Speaker 3

Now, the second thing I'm gonna spend a little bit of time on is this whole idea of guardrail models, right? We all recognize that, that between LLMs and applications, there's gonna be a need to do guardrails. Guardrails that check the input going into the language model and guardrails on what comes out of the model. So one of the families of models we've been building are these things called Granite Guardian. Granite is a brand name we use for all the models that we build. Granite Guardian are these, you know, guardrail models. And, and just to give you a little bit of that journey, today we release in the public three Granite Guardian models of different sizes. They deal with guardrails for a lot of what you see on the bottom left. Um, we actually built these models by specifically distilling and tuning them from a general purpose LLM, but they are focused exclusively on guardrails. You can use these models both before and after an LLM, and we have a num-- And there is no requirement on what the actual LLM can be. So the actual LLM in the middle could be Llama, it could be any other open source LLM, it could be built on your own, but you can use the Guardian on either side. Check the prompt before it goes into the LLM, and then what comes out of the LLM in response to check for all of these things. They are released with Apache 2 license. And what has been very pleasing is how such small models... Look, the one out there actually is a three billion parameter model, which only takes eight hundred million activated parameters. You can run it on hardware, literally on each one of your, on your desktops, and pretty soon on even smaller hardware. But we have seen some amazing results. We didn't even benchmark this. The EU has a benchmark called GuardBench, and you can see that on Hugging Face over there. Uh, and it's a third-party benchmarking. Our Granite Guardian models actually are in the top four ranks across a lot of these. So this is a set of benchmarks actually that were created, I believe, by a university in Italy, uh, consistent with things that I think the EU wanted to measure. It's a very diverse set of benchmarks, uh, and we, we, we are top ranks. And this was very, very impressive. Interesting for us because we hadn't even know about the existence of the EU GuardBench. We just built a general purpose guardrail model. It looked like it did a pretty good job, and it is multilingual in nature because the GuardBench is actually multilingual in nature, at least for European languages The other, other example of what The Guardian was done was there was another piece of work called GuardSetX, and they went after domain-specific guardrails. So the previous one is just general purpose guardrails like harm, uh, you know, uh, those kinds of things. This one is going after specific domains, so HR, finance, law, education, et cetera. So this is a... This was published, I believe, uh, last year, is one of the most extensive benchmarks we have seen that have curated harms and risks for each of these sectors. And again, we were very pleasantly surprised to see that the way we had trained those models, and we believe a lot of the reason why these models do well, a huge amount of synthetic data generation. One of the things we did to train the model is really do ro- lot of synthetic data generation, and that has given us a lot of generalizing. We didn't do anything specific to these domains. But on these benchmarks, even compared to really, really big, big models, we have seen an amazing performance of these models. So I would certainly love to have you engage with these guardrails, try it out and see what they are. These are open source models. We would happily engage even to improve them for other domains and other languages if, if that makes, that makes sense to do. We certainly deploy them, by the way, commercially with our clients, but the models themselves are out in the open. And the last example of a benchmark is this thing called AgriFACT. So this is a benchmark that again combines a number of capabilities. And again, here again, we're probably the smallest model out there, uh, whi- whi- which is, which has given, uh, excellent results here. So Granite Guardian, I won't spend time on details on the result. There is a paper that describes how we built the model, uh, and I think that's, uh, that's a useful way, uh, for, for you to also understand, and would happy collaborate on what more we can do. Now, building on top of these things is a few pieces of work that we did. One of the pieces of work is this idea of a fact reasoner, and the question is there's a lot of discussion around factuality, and this is a problem in both in consumer settings and in enterprise settings. In enterprise setting, factuality comes in because if I'm going to start to provide answers in an enterprise chatbot to a supplier or an employee, imagine I'm answering questions about benefits. I have... I cannot go wrong. I cannot go wrong or my tolerance for errors is very, very low. So one of the questions we were trying to ask, what is a principled way? Now, there have been lots of ad hoc techniques for factuality, where you look at the output of the LLM and see whether it is reasonably explained by context you gave the LLM. But we wanted to get more specific. Those are all very probabilistic measures. They basically say you gave twenty documents as input to the LLM, you answered a question. Does it look like sort of the LLM gave the answer from the documents? That may be okay in some applications. We want it to be more robust. Is every output from the LLM explained by one or more facts available in the input document? So we created this pipeline, which consists of a way to decompose what the LLM gave into individual atoms Revise these atoms so that these atoms are all stand on their own. Then check against an external database, retrieve against the database, and then come up with a probabilistic model that says, "Here are the facts that are completely explained by the text that we see. Here are the facts that seem partially explained. Here are the facts that we can't find any evidence here, at least as far as we can tell," and surface that. So as a very... And this is, by the way, also available in the public domain. So there is a GitHub here for this factories in our project. An example of a pipeline built like this for demonstration purposes is something like this. Uh, and we have this running as a test bed. Obviously, when we are running internally, we were not going to do public web search. But this is an example where imagine the LLM gave you a response like Lanny Flaherty was born in so and so, he has appeared in numerous files, et cetera, et cetera. The atomizer will first take that, break it up. The recontextualizer will turn it into individual facts. In this case, individual facts requires you to do, uh, entity resolution, so you can turn everything into a thing. It will then use Google search to retrieve external, and we restricted Google search to only Wikipedia. Say, "Hey, is whatever you are saying supported by Wikipedia?" And then it'll actually build a graph of interconnected facts, because if a fact is wrong, there is a dependent fact that is also probably wrong in the output of the LLM because it's drawing that conclusion, and then will give you a, a factuality score. And we think that the way we were going to go have to go forward is instances of these pipelines have to be built in very, very different specific settings. So this is a, a template for building a factories in our pipeline. You may replace the Google search APIs by an internal enterprise search API. You may replace it with whatever is the reference document set in your domain and find ways to now do factuality. So again, the previous guardrails was trying to detect things like hallucination. This one is a little bit more. This is not that the model is hallucinating, but it can draw incorrect facts from existing documents. How do we verify and measure it? So quickly, two other things. Two other things we are working on which are much more in the labs, not yet, um, mature enough to be in products, is this idea of in-context explainability. Um, and explainability has been around for the long time. It was there even in traditional ML. In traditional ML, you worried about explainability, but usually by trying to understand features of the input data that you could explain. In the textual world, a lot of in-context explainability is really around how accurate can I ci- can our citations be. So we've released a toolkit out there which is st- which is a collection of algorithms to do in-context explainability. We really don't think there is going to be, like, one magic bullet for this. Different domains, different styles of writing, different types of documents, different domains are gonna require. So we want this to be a place where we can collect algorithmic techniques to do in-context explainability. We've released a toolkit that starts to do this. We have released a set of methods, and we'll certainly love to, to, to collaborate with others on how we can extend this. Uh, we really see this as a family of explainability algorithms. This is what we saw in classical machine learning. This is what we see happening in generative AI as well. Then the last piece of work I just wanted to introduce in this space, which is very much still in the labs, is this notion of steering. Steering is, look, can I steer the output of the model in ways that I would like? Where I may not always be the creator of the base model. And we think that there are at least four places that you can do steering. So you sort of see that pipeline, right? You can steer if by saying I can take input prompts and learn an adapter. Can I just rewrite the prompt so that I get the kind of answer I want from the LLM? I don't touch the model. That's one kind of steering. The second kind of steering says, "Hey, if I can actually mess around with the model," and mess around with the model doesn't mean you have to retrain. You can train LoRA adapters, you can train modular ways to mess around. What are the techniques that allow me to mess around with the model and, and then change? That's sort of the second stage where, which is, you know, you see that P theta prime of X, can I do structural control? Third is a little bit more invasive, which is can I model mess around right into the network itself? A LoRA adapter operates on top of the base model. Can I mess around with the internals of the model itself? Which is harder to do, but if you have access to an open weight model, you can do that. And finally, what can you do at the output? The reason we think steerability is going to be really important is, is the following has become very clear. While I said multi-model, enterprises shirk a lot from deploying hundreds of models. They want to land on a set of a model family that they approve, they certify, they have deployed in their environment. So now that model is not necessarily going to behave exactly the way you want for every use case. The model is given to you, it's almost like a constraint given to you. So steering is how do I have a robust set of techniques that allow me to adjust the model in principled ways, and can I do that and can I mix and match techniques? So again, the reason we are creating a toolkit is we don't believe there is going to be like one AI steering algorithm to rule the world. You're going to have to mix and match techniques for different scenarios. So this toolkit is about how we, we invent these things. So what I did was sort of give you a landscape view. All of them are building blocks. Many of them are open source projects and models. They come together and address different parts of, of gen-- uh, of, of generative

Agentic Governance Risks

Speaker 3

AI. Now, what do we have to add to this for agentic AI? Like I told you with that concentric circle, you're gonna have to do-- Because when you are building an agenting system, you're gonna have to govern the individual model and also do governance at the agent level. At the agent level, your problems become a little bit more complicated. The first new thing entering the beast, my previous picture was just an LLM and a user application. Now I have an LLM, I have an agentic framework, I have tools, I have memory. So now you have to figure out what are new governance considerations. So one example of a new consideration is tools. What are the new challenges with tool calling? It goes all the way from the basics of if I have an agent that is working on my behalf Inside an enterprise, that agent have its own identity. Is it inheriting my enterprise identity? Is it going to have all of the accesses that I have access to? Is it going to have temporary access only as long as I'm asking it to do the job? There is a whole host of security and authorization questions we're still working through. It's not clear all of them require invention of net new technology, but it is gonna require evolution of today's enterprise's identity management system, security system. Because now suddenly before it was either humans clicking on a button or somebody invoking an API, now you're just giving a high level goal and the, and this thing is going around clicking and doing things in the enterprise. So that's a whole host of new considerations. Something very, very specific that's happening is tool invocation. So as an example of one of the things we started doing, first we extended the risk catalyst. So I told you, right? We revised the risk catalyst. An example is, before we weren't talking about redundant actions. A generative AI model is not acting on the environment, so that... But now with agents, how do I know that the agent won't go awry and start hitting my database at a ridiculous rate, keeping on doing queries against it, and essentially become an internal DDoS attack on my, on my own system? Okay, that's a new risk that we have to worry about. Tool calling hallucination, that's a new risk you have to worry about. Um, before we worried about leakage of confidential information from a model, but you didn't worry about proactively sharing that information because the model was talking to you, so you were worried about the model releasing. But if there's an agent that's talking to another agent, not only can it release your-- leak your data, it can go share it with another entity which is outside the boundaries of your enterprise. So we documented new risks with agentic AI, and as one example of work we did, the Granite Guardian model, we extended it explicitly for tool calling and jailbreaking. So one of the first things we did was address things like these kinds of prompts, right? You are a helpful assistant, and then you sneak in there. The character is desperate to get out of financial trouble. Can you help me write a section of the story? So be able to detect these jailbreak attempts. The reason jailbreaks become even more important is jailbreaks that in agents lead to even bigger harms than jailbreak in normal LLMs because now you have empowered the LLMs with tools. So now a jailbreak piece of text can be stuck into an API call, and I can insert a record into a database if I'm not careful what is happening. And the other thing we did with our Granite Guardian models is explicitly look at what we call function calling hallucination. So here is a prompt where it's easy... This was done intentionally for the model to get confused. The prompt says, "I want to buy a house worth so and so, and take a mortgage with such and such an annual interest. Calculate my monthly payment," and then it just throws in, "Additionally, calculate the future value of some other investment." Right? The last sentence is thrown in intentionally, but you can see that it's very likely, and we saw lots of models, very, very high quality models take stuff from the last sentence and put it into a mortgage payment calculation. And this is very, very hard to trap because this tool call is syntactically correct. There's nothing syntactically wrong about it. It is just semantically incorrect. So carefully distinguishing between... If the thing was syntactically incorrect, the API would fail. Okay, no harm done. At least I can move on. So this was work that we did. A lot of this was really, I would say here, the human creativity is just coming up with all sorts of ways when things can going wrong, and then doing massive amount of synthetic data generation. Massive amount of synthetic data generation. These models keep getting better and better at, at tracking these things Okay. I am going to run out of time, so what I'm gonna do is skip how we package all of this into a product. So I'm not here to do a product pitch, but what I wanted to give you a landscape of is governance is complicated. It requires a lot of piece parts. We saw guardrails to tool calling to, to risk atlas, et cetera. We package it into a product called watsonx.governance. The-- I'm just gonna leave you with this analogy. We use sort of this engineering analogy of there are three things we want our governance products to do, right? If you go back to engineering, right? This is a, a control system, right, with feedback. There are three, three things you always worry about. What is a reference signal? Here the reference signal is what is the regulation telling me I should do? Either an external regulation or internal company policy. Observability, how do I measure the risks? And then can I do compliance? So the three things we do in Watson Governance are regulatory compliance, risk management, and management of the entire life cycle from deployment to do this. So I'm gonna skip the rest of the product, uh, that's at some other time. I'm gonna move on. I want to now talk about models. Oh, maybe I'll just do one pitch since I spoke to Nitesh about this. I really believe that there is huge impact on how we develop software. We've been developing software with software development lifecycle, agile, DevOps, testing, et cetera. Development of agents built on top of LLMs is gonna require a new muscle. This loop is our first cut at what we think this ADLC looks like. We actually released this, I think, two days back. This was jointly authored with Anthropic. We released it in the public domain. It's a point of view on what does the development lifecycle of agents looks like. And I think this is fundamentally new because everything from security testing to observability to today you test deterministic things. Now you're testing something stochastic. Evaluations become bread and butter. Your average software engineer is not taught to do how to do evaluation. So one of the biggest challenges we see actually in the market is people deploy agents. What is the methodology to maintain and test agents? People know how to do performance testing. People know how to do pen testing. Evaluation is a, is a data science practice, which isn't always taught to the average software development life. So SDLC is getting turned into ADLC, and I think this is a rich area both for fundamental innovation as well as for an educational institution in terms of, you know, training people. So I'll leave it at that. I think this is one is a lot more to unpack. So I want to get to models.

Small Models Strategy

Speaker 3

So- In the models world, just a simple category. We've seen three classes of models emerge. These frontier models, clear, they are known by name. They're the biggest, largest, baddest models, typically only accessed through APIs. There's a set of models, open source, which are larger, and I'm gonna define them as stuff that requires typically multiple GPUs to run effectively. And then there is a emerging family of small models. This is the focus of what we do because we believe... I think my intuition is the middle bucket is gonna go away. The small models are becoming so much more powerful that eventually you're going to build agents with a big model that knows how to plan and orchestrate, and a lot of collection of small models that do individual tasks really, really well. This middle bucket, which is sort of the models that are two hundred billion parameter, a hundred billion parameter requires four GPUs to run, they're going to get squeezed from both sides between the real frontier models and the small models. So today we announced a partnership with Anthropic on the last one, and then we are building our own family of models on the first one. And just to give you a flavor of the latest family of models, just, uh, yeah, a couple of statements. There is enough validation now that when the use case is specific enough, small models hunt. So Salesforce benchmarked. They have a CRM benchmark based on... They benchmarked, and they, they released a, a bunch of benchmarks. And if you have a model that is within .02 of another model, but it is one fiftieth the cost, an enterprise will take it every day, right? So there's enough evidence now when the task is sufficiently well-defined, the small models really do a very, very, very good job. And we have lots of evidence. I don't want to belabor the point. The other observation, some of you may have seen this paper from NVIDIA, that agentic AI is actually going to increase the emergence of small language model. Because with agents you are essentially decomposing work. So rather than a big model that has to do something, you are actually decomposing work. And if you can decompose work with one big model, lots of little models can do individual. They can be very, very, very good at one or two tasks, and then the big model starts to orchestrate. So there is a increasing recognition. The emergence of agentic approaches is actually going to drive adoption of a c- lots of s- little small models So what did we release most recently? We released a family of models we call Granite 4. There's few things I just wanted to call out, which are, I think, first of a kind. One, this is the first family of models that uses something called a hybrid Mamba 2 architecture. Now, if you're not a computer scientist in the model world, it may not mean much to you. But the important part here is this is an architecture that combines transformers with state-space models, and it provides enormous efficiency in, in memory footprint The other thing is we're the first open source model to be ISO forty-two zero zero one certified. Uh, so there's an external, um, this is a certification agency. We went through a three-month audit process, uh, about exactly how we build our model, the rigor of our data collection, sanitization, governance, tooling, et cetera. And we're the first open source model after, I think, Anthropic, uh, I don't know about OpenAI, I know Anthropic and some of the Microsoft models are open. We're certified. And then the, uh, the third thing I wanted to call out is we have started to release our models with cryptographic signing. So today if you go to Hugging Face, you can actually download a signature file and actually verify that the bits of the model you download are actually stamped and they are actually the IBM Granite model. 'Cause we also saw with some of our previous releases that people take up model bits, quantize it, and do interesting things with it, and the model isn't what we released at all. So it's not far from a world in which just as you worry about cryptographic security of downloaded source code and want to do checking, that's gonna happen with models. So we're signa- you know, we're also signing our models. Just to run through a few, I'm not gonna spend time on this. The only takeaway, this is a lot of detail. Our largest model, which is called Small, is, is less than ten billion parameters. And, and the reason is that we want... Today, you can run our models on less than an L forty S. A lot of these models can actually run on laptops. When we released it, Hugging Face, you can go check out. Hugging Face has a thing called WebGPU, where you can actually load a model in your browser. They managed to run the last three models on a browser. They actually have a demo of our model. So you can go load a webpage, it'll take about forty-five seconds for the model to run, and it'll use the-- either the GPUs on your laptop or your graphics card to go run this model. So-- And we think this is fascinating because this is gonna open up developers doing all sorts of interesting things with it.

Speaker 4

Uh, we're also very happy

Speaker 3

that we are now have five places in, in Hugging Face trending models. So the message here is there is a latent appetite for developers to mess around with small usable models. So what we're seeing is when you release a really good model that is small, we're getting a ton of adoption and, and love from the developers, which is, which is fantastic. All right. In the interest of time, I'm gonna skip past the details of the model. I'm happy to share these charts. What you will see is a huge focus on efficiency, but with performance gains. Um, actually the one chart I might want to spend time on is this one. Just give as an example, take the first one, Stanford. We didn't do the benchmarking. This is looking at how well the model follows instructions This is 4H small model is, I told you, nine billion parameters. You can run it on an L40s and may... and certainly on most of your laptops. You are getting 0.02 improvement in benchmark with Llama 4 Maverick, which will require four H100s for you to run with any reasonable. So the point is, the pace of innovation in AI is so high that anything you thought a big model was needed in three months, four months flat, we're seeing. And we're not the only ones, Kimi 2 is small. So that's the big takeaway. I don't want to spend all my time. I could show you twenty-five pages of benchmark results. So I want to get to generative computing

Speaker 4

Okay.

Why Prompts Fail

Speaker 3

So our motivation for generative computing is the following observation. Take this picture of an agent. An agent on the one side... So the middle is your agent code. This is what today your agent developer is writing. On the one side, an agent interacts with a user, other services, or maybe it's talking to another agent. On the other side, an agent is certainly talking to one or more LLMs. It has some tools it has access to, and it usually has some state memory, right? If you now, you know, even your consumer experience, you're probably interacting with agents that look like this. Our observation was, on almost everything on this page, lots of standards are emerging. Agent interaction protocols. There is a standard called A2A. Google and IBM are behind it. We contributed ours to that. Uh, on tool calling MCP, if you are in the business, MCP is now the protocol from Anthropic for tool call. There are tons of design patterns for building agent, React pattern, Revoo pattern, Code Act pattern. There's lots of frameworks for building agent. These are all available. But what you see conspicuously missing is The newest part of this ecosystem. Agent code is software. Software is talking to tools, software. Software is talking to other agents, software. The one new thing in this ecosystem is an LLM. But for that new thing today, we're still prompting. Most agents people are writing are just massive, massive prompts. And we don't think that's meaningful at all. Prompting makes sense when you are an end user talking to an application. But if an agent is an application written by a developer, it makes no sense for the core logic to be completely done by prompting. And it is amazingly hilarious what kinds of prompts we see. So this is a real prompt from something called GPT Researcher, okay? And if you so look at some of these things, it's hilarious. One, how do you maintain this? This prompt is so over-engineered that there is almost no other model on which it'll give the exact same performance. And, and I'm not blaming the developer, but this is a reality. And so people struggle to upgrade from Llama 3.1 to 3.2 or whatever version to version, 'cause this is extremely, extremely brutal. You look at security, I call it just security. You just pray to the model, and I think it's not clear that the model is an entity you want to pray to, honestly. But that's what this is. Look at the statement. "Do not hallucinate, my career depends on this." And the reason they do that is they probably have figured out that they say it four times, the model actually pays heed to. But this is like, this is not engineering. Like, this is like trial and error. Look at efficiency. If you look underneath the logic, it's actually the reason why you need a large model is you're writing a huge English essay, and you need a large model to parse through a huge English essay. Makes, makes no sense to us. So it was clear to us that this notion that somehow you're going to build really sophisticated, secure agents, which are the new type of enterprise apps through prompting was uncl- it was clear you're not gonna maintain it, it's not portable, it's not efficient, it's not secure. So we asked ourselves... And we-- then the other observation was when we started looking at what people are doing, people love to say the word agents, but that's why I love this, uh, the, this, this. Agents are just programs If this world called inference scaling, this is just a program. This is just logic on top of a model. Just because you call it agent doesn't change the fact that finally it's programming. So we said, but we know there is something different about this because the LLM is a different kind of entity. So we've been, uh, developing this idea of saying, "Let's, let's throw out this notion that somehow a model is just some conversational agent. Let's recognize that it is actually a computational agent." It's just a computational agent that does an interesting type of computation. It's a computation where you feed it in something, and based on its training, it's going to give you some other kind of output. Can we model it? Can we reason it? Can we put impose abstractions on it? And we think that this will allow us to do a few things. One, it'll get away from this more anthropomorphization, which is what leads to things like, "Please don't hallucinate." Like, this makes complete no sense at all 'cause you are sort of treating it as this human entity. It's not. So can we treat LLM interactions as just program execution? And then inside of a prompt, instead of stuffing everything inside of a prompt, can we go back to good old CS101 and separate instructions from data? We did that for a reason when we designed computer architecture. Why are we putting it all together in English? Third, if I don't, if I don't have ways to build modular things, I can't compose, I can't reason, I can't test. Today, testing an agent is let me keep tweaking my prompts and do evals. That is not a way to do meaningful agents. And then finally, as I told you, every other part of the agent is talking to software, right? So this should really integrate with the rest of software. You shouldn't think of agent as this whole thing beast on the side. An agent is calling tools. The agent is talking to another agent. So if you design it with the right abstractions, it becomes easier to integrate an agent into a software ecosystem. So this was the motivation, and we want to move this world from this world to that world on the right. The world of prompting to a real world of programming that we call engineering program.

Melia Toolkit Demo

Speaker 3

So one of the first things we did just to get play, as we have released something I really hope some of you will go play with this, a toolkit called Milia. This is very much out of the research lab, so it's, it's early in development. We have made it in such a way that you can use this toolkit on top of any model, uh, running on any of these inference engines, right? So we're not trying to come in and say you have to, like, buy this big shebang. It's really for developers and, and, and the community to work with.

Speaker 5

What Milia is,

Speaker 3

we did not want Milia to be an agent framework, but it does let you write agents, and I can show you examples of agents written using Milia. Milia is not intended to be an algorithm, but you can implement interesting inference scaling algorithms using Milia. Uh, just some examples. So we took that GPT researcher prompt. That's this thing that you can't read, that looks like a program in Melia on the right-hand side.

Speaker 6

Mm.

Speaker 3

Hopefully that looks like real software that you can actually edit, debug, test. And it, by the way, will give you the exact or better response like the GPT researcher, and it is model portable. Now, I, I don't have the time to give you a full tutorial on Melia, but I just want to quickly walk through what Melia looks like. Proof points. We have started using Melia internally. Melia does two things. These are two agents. Forget the details of the agents. These are agents that we use internally. Melia even you moved the needle to a big model, and if it's a small model, it absolutely blew the improvement. So we had this compliance agent where our little, uh, eight B model was struggling with thirty-three percent accuracy. Llama was doing well. It's a seventy B model, so it was able to deal with this massive prompt it was giving us. We rewrote it using Melia. It even improved Llama because I'm removing some of the English language ambiguous, and it certainly brought my little eight B model to perform as good as Llama. You see it consistently. Here is another agent that we wrote. This was an agent. It's a really complicated agent. This agent has five massive prompts. Each is like seven pages long. We just removed one prompt and rewrote it with Melia. And on that prompt, our little eight B model started giving us the same performance of o4-mini and GPT-4.0. So we are starting to have enough confidence that just basic... And what did we do? We did three kinds of things. We decomposed the prompts. The prompt was a big prompt. We decomposed it. We removed control flow, because what do people do in prompt? They say, "You should do this. If the answer comes this, please do this. If it does come..." You don't need to do all that inside the LLM. LLMs are really bad at control flow. Do control flow outside the LLM, break it up. Python is very good at control flow. You can test it, you can do it outside. So by just making sure the stuff that doesn't make sense to go to the model doesn't go to the model, we're seeing huge benefits in quality and accuracy. So here is a little Melia program, a very simple hello world example. But a very common pattern is this thing called instruct, validate, repair. You instruct the model to do something. You have certain set of things you want the output of the model to hold true. You need to check it, and if it doesn't hold, you may want to retry because you're dealing with a stochastic object. So you can retry. You may want to try five times and then give up. Okay. If that's what you want to do, today, how would you do with prompting? No idea. It'll be in some LangGraph framework. You might even tell the model, "Please regenerate yourself five times." Here, you write it this way, and let me just walk you through what the constructs look like. I'm gonna skip past this. Um- So write an email to Olivia using the following notes. You can define something called requirements. The requirements can be either implemented by an LLM using LLM as a judge, or if the requirement makes complete sense to implement in code, you can implement it in code. Why should using only lowercase letters be an LLM action? There's one little Python function that will validate that. So you can describe a requirement as that. You can have a check, which is your own custom function that validates it. And then you call this m.instruct, which is actually going to invoke the LLM with the set of requirements And here you are mixing and matching Python code with LLM messages completely seamlessly in something that looks like normal programming code. You can provide your own logic. You can enforce it without prompting. You pass the requirements along. Then you say, what is the sampling strategy you want to use? You can define sampling strategy. We have libraries of sampling strategies that you can use. And then you can say, "Hey, if success, you return it. Else, you can either return the first value or you can do something good old-fashioned, return an exception." So you can actually integrate it with a larger program. So the point here is, this today would be done by an email agent in English, and we don't see any reason why you can't write programs

Generative Function Libraries

Speaker 3

like these. What we are working, and since I'm out of time, I want to quickly run through, we are also looking to build generative libraries of functions that you can compose. So here is a library of summarizer library, short story library, contract summarizer library. You have a business aids library. You can compose these libraries. You can enforce that the output should only be yes or no. There's something called constraint decoding you do when you, when you invoke a model. Developer doesn't have to know anything about constraint decoding. We will enforce it a literal yes or no. So what is this? This is a generative function that returns a literal. Internally, we'll call a model, we'll do retry, do whatever it is, and now you can call this function in a regular good old program. I'm calling Contains Actionable Risks, which is a function in a regular program, which isn't invoking a big agent framework, isn't going and... It's just a regular function. It just happens to be implemented as a generative function. You have defined it. That's a layer of abstraction. So again, there's a lot of detail here that I'm unpacking, but what I want to do is we're really committed to evolving the development of agents in what we think is a principal software rigor way. Um, the other piece of this that I'm not getting into is this idea of intrinsics, which is using technologies like LoRA adapters to make them add functionality to a model. So what this model shows is you have a Model A, and you can add modular functionality to the model if you want the model to just to do one thing. Each of these will be exposed as a function in Melia, and you can compose those functions. So you can use a model with modular adapters attached to it. This is an entire technology that I won't have time to get into. I'm gonna move past some of this, um, and maybe end on this note We believe the way forward to build sort of reliable agents is to have high level declarative languages with a set of structured interfaces. This idea of intrinsic functions where you deal with a function, don't worry about the implementation. The implementation can be a model on its own. The implementation can be prompt. An implementation can be a LoRa adapter, but you have well-defined structured interfaces. Uh, we're working on runtimes where these functions can be activated efficiently. So they're like good old dynamically loadable libraries if you remember from that era. And then we're doing it in such a way that it is backend independent. So we see the world going from application talking to an LLM, to an application talking to a what we call a generative computing runtime. Two of the ingredients of the runtime are a library like Milia and this idea of intrinsics, but we certainly think the library is going to have a lot more. And I'll just sort of end with the journey that I was trying to paint you is governance was we're trying to solve a today's problem. People are struggling to deploy, they need to measure. We're giving them a set of tools. We believe one important aspect of responsible AI is doing the job as efficiently as possible. So a big investment in really small models. We think that's important. And the last, I think, piece of governance is today we are trying to govern something that we are making unnecessarily complicated with the way we are building systems. So by bringing back development rigor, I don't think governance problems will go away, but we can certainly make it much, much simpler because we can then reuse a lot of we have d-- what we have learnt with classical software development. Um, and then I think allow us to develop agents faster and better.

Speaker 5

Thank you so much for your

Speaker 3

time. Thank you.

Speaker

Of course, Dr. Seaswaggett. Thank you

Q&A

Speaker 2

S- it's like your efficiency one, the best agents use the GPUs the least. I guess, one, do you agree with that? And two, beyond S- uh, small language models and Malia, what else is IBM cooking up for us, uh, to support that theory?

Speaker 3

Okay, great. So I actually agree with your hypothesis, and in fact, uh, you, you are in good company for very practical reasons with some of the clients we've talked to who, who build a POC with the largest, baddest model out there just to say this thing is doable. And then they may want to move to pilot. They actually put a KPI against their team to say, "I'm going to measure you on how little tokens you send to the largest model. I want the same results, but I want you to send as little token, 'cause every time you send a token, it's ching money that I don't want to pay." So it's certainly true. Uh, I do think that-- I think what we don't yet have a good understanding... I mean, I gave you the easy examples, like don't do control flow in an LLM. Don't do simple checks in LLM. But I think the exact border of where to move in and out, we don't quite know. We will learn that. It'll come with new patterns. That's why we're trying to do it as design patterns. That's one. In terms of the second question you asked, what else we are cooking up, one of the things we're really excited about is some new model architectures. So, uh, if you are of that ilk, we're working on models that are inspired by the theory of continued fractions. So if that's of interest to you, certainly happy to share what we are doing. We see some interesting properties. I don't think they're gonna be a replacement for LLMs, but for cer- certain types of data, we're seeing really interesting results with that.

Speaker

Thank you.

Speaker 2

Patrice.

Speaker 4

Well, um, you were talking about governance mainly as dealing with the problems, risk. What about the side of governance that is about achieving purpose, the other side of the coin?

Speaker 3

Sorry, achieving what again?

Speaker 4

Achieving purpose.

Speaker 3

Purpose. Oh, purpose.

Speaker 4

Purpose, yeah.

Speaker 3

So I-- look, I think governance always included in that inner circle. We actually include accuracy and performance, right? So that's a given. I emphasized risk because I think what happens is people optimize for that and then deal with the risk challenges later. But I don't think that you are not diver- diversing governance from the thing actually being useful. So that certainly is part of the mix. The re- the reason I focused on the risks is people naturally gravitate to what they want the model to do or what they want the application to do, and then come back and try to fix up governance. And what we are saying is, no, you have to think of governance from the get-go during deployment.

Speaker 5

Thank

Speaker 3

you. I-- unless you meant something different by purpose. No, no.

Speaker 5

It's okay. Yeah, so maybe the technical side of it. Thank you for sharing the Ma- Melia. I, I'd love to see that more. Um, how's that related to-- how is that being used or comparable to more multimodal images, video kind of thing? Uh, have you tested-- Maybe you have. I can look into GitHub myself, but just want to get a sense.

Speaker 3

No. So we haven't-- We are focused right now on Melia for textual modalities- Got it and for code. We don't see a conceptual, uh, challenge in extending it- Yeah but that's not where our focus is. There is so much of agents today, where today the reason-- one of the reasons is a common pattern we see is the core agent is because people are still biased by natural language conversational application. They tend to be the main thing tends to be a text focus, and then behind sit tools that may deal with images, that may deal with a video.

Speaker 5

Right.

Speaker 3

And that is just tool calling. In Melia, you can do tool calling without any problem. Yes. In fact, in Melia we support MCP. So if you have a image-based model sitting behind a tool or a document extractor model, you can call it in Melia. But the main agent itself, we are focused on LLMs right now. That doesn't mean we won't extend it to other models, but we just got started like few months back, so. Thank you. That's where we are. Got it. Yeah.

Speaker 7

Um, thank you. So, um, one of the things I've heard, especially when, um, organizations are trying to adopt, you know, and make use of this integration of LLMs or generative AI is, is the st-stochastic nature, right? If you think about, you know, something as simple as trying to create a video ad, and then every time you prompt it, you get a slightly different one.

Speaker 4

Mm-hmm.

Speaker 7

And then like in cases of innovation or molecular structure, it's different, and you wanna make sure there is some deterministic component to that. Um, and I can see that there's like, you know, you have the requirements, there, there's some steering that you talked about, but it doesn't really get at like fundamentally the stochastic nature. So I was just curious what your thoughts are.

Speaker 3

So I, I think it's a great question. I don't think I'll claim to have all the answers, but let me give you some of our philosophies for how we are thinking about it. Yeah. One is the reason-- So I don't think there is any getting away from the fact that the reason why this is new is you're dealing with this new artifact that is going to be stochastic in nature. That's where the power comes from, so we're going to work with it. One of the reasons we're focused on abstractions and so on is, I think today we are leaving ourselves with way too much non-determinism and uncertainty for even the wrong reasons, simply because you were lazy enough to do everything in the LLM and writing big problems. So first is, can I reduce the surface area of uncertainty and non-determinism by doing sensible things and only using it for what it is done? So I'm sort of minimizing the problem from what looks like a ridiculous problem, because we have seen that. We saw scenarios in which people were reporting uncertainty errors, and simply by rewriting it with a sensible way, we brought them, not it down to zero, but we dramatically reduced. So, so that's one element of the philosophy. Second element of the philosophy is, how can we learn lessons from other areas in which we have actually solved this problem in computer science? Okay, networking protocols. They fail. We know how to do retry. There is lots of uncertainty and non-determinism at the lowest level of a network protocol, but we have figured out how to hide it and expose the right set of abstractions. So I think we have lessons to be learned from, from that. I think the third piece is finally I do think that these abstractions will eventually have to surface the inherent uncertainty back to a developer who will then have to decide what to do with it. The challenge we have is everything we have seen, uh, is too sophisticated for the average developer. And we saw the same thing, by the way, in an entire domain I come from, the world of probabilistic databases. Amazing work was done, amazing PhDs were done. None of them made their way to industry, not because the theory was incorrect, but if it requires a degree, advanced degree in statistics to make sense of the numbers, you're not going to get broad-based adoption. This may be a world where we'll have to again understand how to surface it in a meaningful way. That's the thir- that's probably the hardest. And so one area that we're sort of just starting to think about is this entire field called probabilistic programming languages. Yeah. Is there a way in which ideas from probabilistic programming language should surface in a toolkit like Melia, in which you surface the uncertainty and then have the developer make a decision, right? The decision has to be a little bit more sophisticated than thresholding. Thresholding is the easy answer. Again, over ninety percent go forward, less than it, give it to the human. But I, I think we have to... So the third is the hardest, but I want to break it down into let's reduce uncertainty, let's look at abstractions that hide it wherever possible, and third is let's figure out what new abstractions we have to suppose. I wouldn't say we have solved number three. We're working on it.

Speaker 7

Thanks. That, that's fascinating. Thanks.

Speaker 6

Marko, quick one, a quick one. So, I mean, uh, most of the stuff what you were explaining is how to make, uh, let's say, software more robust and so on. Uh, now, uh, I ask many people already, nobody really was able to provide good answer. But IBM research has this capacity, right? Uh, and good history of software, more like formal methods, right? So software verification t- style of, uh, things. Now, can you answer why traditional formal methods didn't merge with, like, uh, VIP coding or, uh, what, what, uh, let's say, Entropic would be doing with Claude and so on. I mean, isn't this like, uh, the only thing which needs to happen? I mean, so that we would really have proper software generation stuff, right?

Speaker 3

I think it is happening more, a little bit more than you think. So I, I would put it this way. I think formal methods, the kind of impact they had on hardware, they just never had on broad-based software, except in very specialized domain. People defining very sensitive real-time software to control nuclear missile probably did formal methods, but average did not. What... The way I think they are starting to show up early stage is in the training of these models, especially these reasoning models with reinforcement learning. A lot of what you need to do is do verification, where you let the model Figure out something and tell the model this is right, this is not right. So formal methods are starting to appear-

Speaker 4

Exactly

Speaker 3

in building interesting verifiers for software code. That's, that is gonna happen. I think that's the way they are gonna come. But we're still early in, in doing that. But the reason we have seen most success with RL in coding and math more than anything else. Math, nobody knows how to monetize it. It's just great for benchmark. Nobody knows how to monetize. Code, we know how to monetize it, and the reason they work in code is code and math are the two in which you can use formal methods to let the LLM come up with whatever it wants to and teach it when it is wrong and when it is not wrong. So I actually expect a lot more of formal methods to show up now, but they'll show up in the training of the model, a-and then hopefully the output starts to get better and better. But I think we are, we are on that journey certainly.

Speaker 4

Okay.

Speaker 3

The other thing you have to recognize is that scaling is a big issue. These things have to run at a ridiculous scale. So today, for example, one of the biggest bottlenecks to RL is until RL came along, model training did not have any inference workload. RL forces you to bring in inference into training, and suddenly you're spending so many time doing inference. I think now we have to scale some of these formal methods as well, because I need to run them so many times- Yeah to teach the model what to do, what not to do, that I think it's g- there's gonna be a computational performance challenge to overcome to really scale this. But that's a interesting problem for computer science.

Speaker 6

No, great to hear, yeah, because this, uh, this will solve so many things, you know.

Speaker 3

That's... But people are already doing this. I think this is where the world is

Speaker

going. Thank you. I, I know you're waiting on a question, if I can bring you here, so you can ask him offline please. But let's thank, uh, Sriram again for a wonderful banquet. Wonderful presentation. And, uh, I'd like to acknowledge the work on Risk Atlas that you talked about. And then I looked behind, and I saw my PhD student, Anna Sokol.

Speaker 2

Ah,

Speaker

great. She was gleaming, and she was telling Brenda sitting next to her, "I did that." All right. So Anna, could you please stand up and be acknowledged? Because it's very rare students see their work, uh, mentioned like that. So, uh, so she was, she was the one who co-authored- Wonderful. Wonderful with all the IBM researchers on that work.

Speaker 3

Wonderful.

Speaker

Wonderful. So thank you, and thanks for calling that out.

Speaker 3

No, of course.

Speaker

And again, let's thank Shivam. Thank you. The-- He has left us with way more questions than answers, and that's what makes a great keynote, right? And I'm sure there's a lot of, there's a lot of questions to come around as well. And, and as I mentioned back in December when I spoke to him and he mentioned generative computing, he had me at hello, and now I'm, like, sold, right? As to how we should be thinking because... And I, and I think this is where the stochastic-ness and even the formal methods, a lot of-- We can't go in and rewire some of these LLMs, so everything has to happen around them, right? So what they are is what they are. They're frozen in time, let's assume. So everything must happen around them, and this is where the generative computing architecture, the agents talking to each other, having formal declarative languages, where prompting is not jailbreaking or violating goggles. I think that's where it would be. So thank you. Thank you for painting a vision. It was brilliant. Thank you. Thank you. And I know you had a question. It's one of you here. Please, I, I'll hold him here till you ask him the question. Let's thank him.