The ThinkND Podcast

The ThinkND Podcast brings Notre Dame to you and will inspire you to continue learning, thinking, and inquiring. Whether you missed a live event or want to learn on the go, the ThinkND Podcast has you covered, from Art and Science to Health and Religion.

All Episodes

The ThinkND Podcast

Soc(AI)ety Seminars, Part 6: Can LLMs Reason and Plan?

March 30, 2025 • Think ND - University of Notre Dame

0:00 | 1:04:17

Episode Topic: Can LLMs Reason and Plan?

Large Language Models (LLMs) are on track to reverse what seemed like an inexorable shift of AI from explicit to tacit knowledge tasks. Trained as they are on everything ever written on the web, LLMs exhibit “approximate omniscience”–they can provide answers to all sorts of queries, but with nary a guarantee. This could herald a new era for knowledge-based AI systems–with LLMs taking the role of (blowhard?) experts. Listen in to Subbarao Kambhampati, professor of computer science at Arizona State University, who will reify this vision and attendant caveats in the context of the role of LLMs in planning tasks.

Featured Speakers:

Subbarao Kambhampati, professor of computer science, Arizona State University

Read this episode's recap over on the University of Notre Dame's open online learning community platform, ThinkND: https://go.nd.edu/e11e2e.

This podcast is a part of the ThinkND Series titled Soc(AI)ety Seminars.

Thanks for listening! The ThinkND Podcast is brought to you by ThinkND, the University of Notre Dame's online learning community. We connect you with videos, podcasts, articles, courses, and other resources to inspire minds and spark conversations on topics that matter to you — everything from faith and politics, to science, technology, and your career.

Learn more about ThinkND and register for upcoming live events at think.nd.edu.
Join our LinkedIn community for updates, episode clips, and more.

Introduction to Human-AI Interaction

1 0:06

Before I got into this issue of, do LLMs have reasoning and planning capabilities, the five years before that, five to seven years before that, my work was. Essentially in human AI interaction and in particular human aware planning and decision making systems. This whole issue of explainability, advisability, interpretability and how does a robot working with a human makes sure that it understands what the human wants and also makes sure that the human understands what is about to do. And so this is whole mental modeling. the, approaches, mental modeling abilities that are important and so on. The Sally and test, for those of you, I know this is like a interdisciplinary crowd, so you know, Sally and test and so those sorts of things for what that we were interested in. So there's actually a book that we wrote, summarizing our work that's actually available on the archive. So if that part of the work is interesting, you might be interested. in addition, I just, I guess the expertise in AI is far and wide these days. it's been said that the fastest things on earth is, people becoming experts in ai. As I was trying to tell you, I am like the old fashioned, very slow expertise. I've been working in the area way before it was full. my database and, software engineering colleagues will feel sorry for me. Why are you working in AI raw? And they come and tell me now that we too are working in ai and we are all ai. So maybe the, we always look for interdisciplinary collaborations. And in fact, as I was telling some of the people that I was talking to this morning, the Lucy Institute, essentially, first of all, it's a great way to make, interdisciplinary collaborations happen. And then AI becomes the thing, the force that brings people together because everybody is apparently doing AI and so that's a very good thing. So I basically on the LLM side, I've been looking both at actually making sense of. Large language models just as the rest of you have been. if only with the fact that I have some background in AI and so I can see whether something is possible. Those of you from physics might remember, there's this idea called dimensionality analysis. If you have an equation that I make up and I give you saying, here's a great equation I made up, you can do dimensional analysis and say, if left hand and the right hand eventually reduce to the same dimensions. That doesn't make the equation correct, but at least it makes it not correct if in fact one side is length and the other side is time. And so part of having background in AI allows you to. Take a more skeptical outlook on the kinds of claims that were being made about LLMs. and so that's one of the things that I was doing. so I write, some of these things and one of the other things that I spend a bunch of time, because I apparently can't get enough of teaching, is I also twich. so I wind up writing everything that comes to my mind about technical issues of ai, on, on my Twitter and also post on LinkedIn. It's fun because if everybody wants to be an expert, everybody should be taught to be an expert is my point. And I'm doing my bit. and then, one of the other things is that I was saying that we will talk about the fact that, LLMs, my, my general sense of what LMS are, there's this amazing technology. I, despite what, you said, I'm not here to yell at. They're good for what they do, and you need to have an idea as to what they're good for and make sure that you use them for that and not for other things. And my general sense is that they're great, powerful cognitive authentics rather than an alternatives to human intelligence in general. And so with the humans in the loop, you can use them in very powerful ways. Interestingly, there was this article, just this today about how New York Times essentially went to this movement behind like some election site, about, election integrity site and found all their audio recordings, et cetera, 400 hours worth of it. And they were able to write an article carefully analyzing it, partly because you have LLMs and VMs and so on. And so you can summarize and with the New York Times editorial world being in the loop, essentially making sure that it's not hallucinating away. And in fact, this business that we'll talk about hallucinations. It's a feature, not a bug. It'll never go away from LMS because of the way they are designed. and so it's worth noting, and I must mention that, I saw this in the, at lunchtime and I decided to put these slides just, by taking it because I was talking to Matt Sisk at breakfast and he was telling me about this great project that he is doing, with the Colombian Rebels documentation, and they have a paper in wisdom. Those are great uses of LM So basically force multipliers for human intelligence, and I think they're great use of that. and then in general, I also, in addition to doing those general sorts of things I have, you might be interested if you're interested the general part, there's like a, a machine learning street talk, thing that apparently was pretty popular and I specific technical research that I do about LMC is about planning and reasoning piece of LLMs, all the way from normal LLMs to the more recent O one strawberry stuff that we will also, touch upon. So and it's not just Twitter. I actually have papers and because I have students, you know the, I keep saying that the difference between a thought leader and a researcher is PhD students because PhD students actually do the work, right? and again, I think anything that you do these days in AI is like, big business. And in fact, our work was, I was a New Yorker,'cause they were discussing some of the work about skepticism, about, planning capabilities of lms. though in terms of trying to make this talk, I had this question. As to whether it should be the technical talk, just about the planning and reasoning capabilities for which actually I have a tutorial as well as a, a talk, a keynote talk that I gave at ACL. they're both available on the web, but those of you who are a lot more interested in the technical part, please look at that. But having talked to, nih, I got the sense, and also having talked to people in the morning, et cetera, this is a more interdisciplinary, audience. I don't know who is here from what background. So I thought rather than just go directly only into what the title of my talk was, which is about LMS and reasoning and planning, I'll split it into two spots. One small part about my perspective on LLMs, which I think will be helpful even if you don't care about planning and reasoning, because it's helpful in evaluating users of LLMs in your own, work and pushing back on, some of the misconceptions. And the second, longer part, hopefully is the planning and reasoning. in capabilities of LLMs and what we'll find essentially is that they don't have any reasoning capabilities in autonomous modes. And if you actually see that they gave the right answer, it is probably because they have essentially got that from the training data. So there is a sort of, they don't do database style retrieval, but they do approximate retrieval, we'll talk about in a minute. And so if the training data contains very close information about the question you have, they might wind up having an answer and there is no guarantee that the answer they gave they know is correct. It's the, actually the correctness of the answer depends on your evaluation. So if you're asking questions for which you don't know the answer, and you're dependent on LLMS are off. So if you ask it for a plan and it gives a plan that looks like a plan we'll see in a minute that LMS are wide machines. So they will give you things in the right format, but content is completely up in the air. There's no actual guarantees. And we as a human civilization have always, for random reasons, thought style was harder than content. thing about Shakespeare is not that he could write stories. Anybody can write stories, he could write it in Iam, big ter. Now, any random guy can write stuff in i Big Penter. They can write like a random essay for a class and ask please put it in Iam, big penter. He'll put it right? And so all of a sudden there's a change in outlook that we have to have. That style is actually easy. content, which is harder for us, is hard Content, which is easier for us, is harder for these things. We need to figure that out. And then I also will put a positive spin on LLM planning capabilities. It turns out that LMS. While they cannot generate plans or reasoning, or give correct answers to reasoning questions, with any guarantees, they are amazing idea generators for pretty much any kind of question, any kind of thing. So they can give you plans, they can give you domain models for plans. They can give you criticisms of plans, but all with no guarantees of correct. And so the one interesting question would be, as I said earlier, LMS are great as cognitives if you know what you're looking for. Having LLM generate ideas is great. If you want to automate that process, we'll talk about this idea called LLM Modular Systems where LLMs will generate plans, but there are external verifiers that bank of them that might sign off on whether or not the plan is correct. If not, they'll say, here is something wrong with the plan. And that becomes a back prompt, automatically generated back prompt. But those of you with AI background, you heard of backtracking in computer science, this is back prompting and the, the, verifiers are back prompting. And then hopefully this, based on the, after the number of iterations, you might very well actually, continue, stop and, come up with a plan which the verifiers like. The good thing about LM module, which I'll again get back to in a minute, is anything comes out of this, it's guaranteed to be correct, whereas LMS will never shut up. Whether they know the answer, not know the answer, they will never shut up the movement. You put a return, they will start outputting tokens. and so it's actually important if to actually speak and quantify your ignorance or confidence, and that is something that you can do with this alum. So part two is a lot more about my research. Part one is more about my, perspective that I think would be useful even if you're not interested in planning. okay. So in getting there, I want to just quickly say that, AI as a whole basically used to focus on explicit knowledge tasks. There's things that we not only do, but we know how we do it, like playing chess, doing arithmetic, doing integration. We not only do it, we know how we do it, and this is a codified knowledge. And we went from there to tacit knowledge tasks where we know I can see mutation say that is mutation. But if you ask me, can you prove an ex, give an explanation as to why that is mutation and lost. Of nh, it's a cat picture's the same exact thing because whatever you say can be falsified, you can make things, whatever you say, this is the reason why it is a cat picture. I can change all of that and can still be a cat. So this is the asset knowledge that we have and we deal with both of them. And AI systems originally used to be dealing with explicit knowledge tasks mostly, but then they went into asset knowledge task too. and that's actually when they became very useful because things like, image recognition and manipulation, et cetera, wind up being tacit knowledge tasks. And so that you had to wait for that. And one of the interesting things is original AI systems, because we knew how to solve the problem, we could actually provide a procedure for AI systems to follow and so on and make into a search problem. Something like image recognition. You don't know what you, how you are doing image recognition. So you have to hope that the system learns the same way you may have learned, which is looking at huge number of examples. It's not exactly the way humans learn because human babies take the same number of images to learn about the concept of cat. They would essentially be like 35 years before they can recognize a cat. but generally you're just showing examples and then expecting the system to figure out the classifier. So we went from reasoning to learning from data. So that's like a one big change. The second big change in the AI technology that's like worthwhile keeping in mind. And I'm talking about like a, since about 56 when AI started to now, the first big change is like seventies timeframe. The second big change is to go from deep and narrow to broad and shallow seats. So 97 Deep Blue one poem Know Over Caspar and stopped the human supremacy in chess. Okay? But that's a deeper narrow system. Alpha Go is a deeper narrow system. it gives guarantees about its moves, but it only does that and nothing else. Broad and shallow systems, on the other hand, are the jacks of all trades, but masters of math. Okay? And you would really like jacks of all trades who are also masters of everything, which is full on ai. We don't have that. Okay? If anybody else tells you something else, I think they're just misleading you. Okay. So from we started looking at broadened shallow systems like LLMs are a great example of broadened shallow systems. They are jacks of all trades. They can answer questions about medicine, they can answer questions about sociology, Colombian rebels, but they could be wrong And about all of them with pretty much equal chat. That's the thing that we're remembering. And the second thing is we went from a machine learning technology, we went from discrimative classification to generative imagination. So typically we tend to say, so when I show you a cat, can you see that it is a cat? When I show you a picture of the dog, can you see it's a dog. When I show you a picture of a a stem male, can you see that? It is a stem male. generative machine learning on the other hand, tries to learn the joint distribution directly so it can produce the spam mail, it can produce the cat picture, it can produce the dog picture. You see what I'm saying? And when it learns the distribution, one of the advantages is then it can generate one. generative machine learning in general learns the distribution of the objects under consideration. LMS for LMS objects are text documents, so it learns the distribution about how text is written and so it can sample from the distribution, generate new text. That has never actually been written. So Turnitin style, plagiarism detection software is enough, no use, and yet the probabilities with which it's actually generating the next token are very much dependent on the training data. It's been trained. generating machine learning essentially, learns the distribution of objects. That's very important, because as I'll mention a bit later, tile is a distributional property paths are instance level correctness and paths are instance level. Property tile is a distributional property. This is part of the reason why generative machine learning systems can make people think they know what they're talking about because they give the answer in the right format. You see what I'm saying? so that's very important to keep in mind. so LLMs are of course, the well-known broad and shallow generative systems for language or any essentially sequential data. So today I basically focus on auto trained l LMS mostly until the last minute when I will talk a little about, strawberry and O one and how it is actually not LLMs really in my view. It's a stone soup idea, which is they actually add reasoning on top of lm. So it does rise too, but very costly way of adding reasoning. It's interesting, technique, most of the other things before oh one are all are aggressive teacher force training, LLMs, and that we'll talk about in minute. And we mostly focus on capabilities and limitations. rather than telling you how LLMs are exactly trained, what is the number of cap, no parameters, et. so I do wanna mention that trying to start any LLM talk saying limitations, the L word, basically loses how the audience, I'm surprised that none of you are leaving already. so I wanted to quickly say that I don't want to be seen as this big bad guy who's trying to kick a cute puppy. This is like the amount of popularity you get if you say there's anything wrong with Chad GPT. so the point of course is that, part of the hype surrounding and the thing is rose colored gases that are never made in bifocals. People really only want to see the good parts of lms. But, I seen the good parts already I'm going to tell you a few bad parts about LLMs. so I come to River LLMs not to lament them. part of what happens is basically if you wanna extend technology and use it. A clear understanding is needed if you want to be an AI influencer, which is pretty much everybody outside of this room right now. if you look at LinkedIn, the most common profession for all the blow hards is AI influencer. I don't know what the heck that means, but, if you want to be that, it doesn't matter, so you can always just hype whatever is the latest hype thing. crypto previously and LLMs next or something. Okay? so one of the interesting things that I wanna mention is that AI in, has become, in terms of using LLMs to do ai, it's become natural science. Maybe there are people in natural science here, like zoologists will look at an animal and look at its behavior and try to learn what is the kinds of behavioral patterns it has. These LMS are not, they're animals that we designed, but design is a too strong a word. We just trained them. We have no idea. What actually they learned, and we are trying to poke them to see what they can do for various prompts. And so just as a geologist, seeing like a particular animal in Amazon Basin might be able to temp it to write, oh, if I kick it on the shins, it, flies about three feet into the air. So you write a prompt. If I give this following prompt to LLM, it seems to do reasoning that's a paper. And the problem is you need empirical rigor to make sure that you're not just falling for the first thing where you are confirmation bias, what you're expecting LM to do, actually did. So exactly when you get the right kind of behavior, you should be much more careful to check whether it is for the reasons that you expect it to be. so again, one other thing I want to say is I make this connection to alchemy. we are in chemistry building. alchemy doesn't say that chemistry is a bad discipline. Chemistry is a glorious discipline. The problem is these bozos who thought, if you look at chemistry in a funny way, it becomes nuclear physics. At that time, we didn't know that they were bozos. We now know that there's no amount of chemical reactions that will produce gold from base metals. Okay? So alchemy has been extremely useful in terms of, pushing progress in chemistry, but it wasn't useful for what they thought it is supposed to do, which is converting base metals into gold. it's the same thing in some sense. it's a great technology. It is going to add us do lots and lots of things including like that, that in New York Times thing that I mentioned and many things that you're doing. And we do want to make sure that we understand that, basically what they're good for. Without much further ado, you might already have heard many times that LMS are essentially gram models. They're, on steroids except in general a statistical, language processing has involved taking strings and doing two gram or three gram analysis, a two gram analysis. Say, given a word, what's the most likely next word? A three gram model will say, given two words, what is the next third word? Okay. And this has been around forever. And Shannon actually was looking at this. The interesting thing is LLMs are, can be really with a very good approximation, part of as an gram models, but their gram models are stupendously large. This slide was made at the time of G PT 3.5. The original chart, GPT, that has a context of 3000 words, it has about 4,096 tokens, which approximately corresponds to about 3000 words. So given a 3000 word window, can you predict the next one? Okay. That's basically the way to think about LM. So it's the same idea except at a hugely larger scale. And why were people not using it? in the time of, Shannon, he basically looked at some carefully collected text data, and he did almost handwritten checks on what are the likelihoods of words coming, given a particular window and so on. the reason this becomes hard to do at that time, and it's actually possible to do now to some extent, is when you look at something like a 3000 word context, Think of how many different 3000 word context can you get. Imagine there is an approximate vocabulary of size 50,000. So there are 3000 words sequences. 50,000 times, 50,000 times. 50,000 times 50,000, so 50,000 power, 3000 different sequences for each of which you need a probably distribution as to what's the most likely If you ask Google what is 50,000, 3000, it says Infiniti because that's a ridiculously large number. It's that number you want to compare to one 76 billion, which is the number of parameters that Chad g PT used to approximate dysfunction. You see what I'm saying? So any approximation, wind generalizing compression leads to generalization. The real question is that good generalization? That's the thing that we will try to figure out. There will always be anytime you compress, you wind up generalizing. That's like a machine learning 1 0 1 thing. So what LMS do is essentially compress, approximate this countable with the function. And so you should be impressed that this thing is doable at all. That in fact a gigantic function is being approximated with only one 76 billion. And this is 3000 word context is already old news right now we are talking about million word context. So that's the interesting part that the idea is fine, but actually the fact that if you spend enough resources, the fact that you can actually make this work, and that we don't really have intuitions on how 3000 word context completion. If I say left and you know that the next word is likely to be right, but if I give 3000 words and say next word, it's much harder. You don't have very good intuitions. And that's the thing that LLMs are computing. I'm assuming that you understand this sort of training business, but I'm not going to go into those details. you can. I use this because we, are in Arizona State University and our bitter rivals are University of Arizona. We are called Wildcats and we are called Sunday Wills. And so I use this, sentence wildcat, but a bunch of wanna be Sunday wills to explain how LLMs are trained. I'm assuming that essentially they take the data and then they mask one word, try to predict it using the current parameters, and then see the error between the prediction and the correct word and. Push it back to the neural networks using essentially the normal back propagation techniques. The only difference is if it predicted geese and the correct word is cats, then you need the error between cats minus giese, and that is the embedding. So it turns out cats and giese are actually vectors, real vectors in some latent space, and you take the vector difference and that's the difference that you're gonna propagate. Okay. one other thing I wanna mention before we just go forward is people think that they understand lms, and probably many of you do, but some of you may not have noticed that people tend to think this as the mental picture. Oh, it's, I'm sorry. the, oh, wow. Yeah. They think, wow, this is very high tech. they think this is some kind of a network. Okay. It's not some kind of. It is a gigantic neural network that you can't wrap your mind around. But GPT four, for example, these are the numbers. DP four model, the training size. If you look at the amount of data it's been trained on, and you consider, publishing in books and then putting in the stacks of our things, it, the number of bookshelves will be 650 kilometers long. And when people say they're running out of data, is this level of data that still, actually they're running outta and compute size would take 7 million years. And then the model size is basically an Excel spreadsheet would be about 30,000 football teams. Just to give you an idea, somebody else actually nicely calculated these things just to give you an idea that this is ginormous numbers that you really don't have intuitions about. the one thing that I want us to remember, and this is actually going to come back for times, is. LLMs, I would say do approximate retrieval. Basically that's what gram models do. Gram models the probability with which they predict. The next token depends on the data they've been trying on, a database system given a query, they record, they exactly get the records that match the query. That's what databases do. Okay. and information retrieval systems such as Google, given a textual query, retrieves all the records that are similar to the query. That means if you go to New York Times and write a couple of keywords, they'll give you the actual New York Times articles that happened to be similar to the keywords that you have given. They're not making up the articles. That's very important. You ask the same keyword query to lms, they will generate you a New York Times story. That New York Times never published. And you have seen this, right? In general, if you ask a biography of you, it'll say, Ravi is famous for the following paper, which RA never wrote. So RA very quickly writes a paper and puts it on the archive so that people will cite it because they're also asking cha GP, d what RA does, right? So the point being that approximate t LMS is this kind of a beast. It's not databases, not ir, it is generating at the runtime, essentially the completion of the prompt. Okay, we'll talk about rag that some of you may have heard about augmented generated systems. If you want just a simple answer right now, rag is calling Google, giving the top care results into the prompt and asking LLM to summarize that reduces the hallucination, but we'll talk about later. So having said this, people say hallucination is a problem. Now hallucination is a feature. That's what they do. And gram models generate. Stuff that they have never seen, which is exactly why we think they're creative. And that doesn't necessarily mean that when they generate creatively new stuff, there's no guarantee that they're going to be correct. They're only, it's only guaranteed that'll be in the same distribution as the kind of stuff that would have been the completion for that prompt. Do you see what I'm saying? So hallucination is essentially a feature. All they ever do is hallucinate completion to the prompt, such that the completion is in the same distribution as the text they have been trained on. And prompt engineering does not change this. If you don't know the answer, you don't know if the answer that LLM gave is correct or not. It's like extremely important to understand. So hallucination will never go away. the idea is you can control the retrieval such that hopefully it may be higher accuracy as post factor decided. But LLM itself cannot give any kind of guarantees that it's actually going to be become. So I use this thing to say that the impressive, factual reasoning, abilities of LMS all depend on prompter knowing the answer, which is like the most important thing, because if not, without that, they're actually not going to be able to give you the correct answers. I even have made a t-shirt and I sell this because I can't start startups saying, LMS don't work as well. so that's my job. So I actually also told, New York Times, and so I did my bucket list where there became a quotation of the day for a while, and then a couple of months later, the guy who trains Bard, in Google essentially said the same thing, right? Because this is well known. It's just a question of what is the message that you are getting? People in the know that this is the reality. The most important thing, as I already mentioned, is LLMs make us and generative machine learning in general make us think about style versus content and farm versus factual. LLMs are amazingly good at style. catching the style, which is actually hard for us. Okay? because they're generative machine learning systems and machine learning systems basically learn the distribution, but factuality, as I said, is an instance level property, and they can, there's no guarantee that the factual by a generative machine learning system by itself, it improves in the correct style. And this right syntax might hopefully also lead to right content, but there is no actual causal connection between the syntax and the style and the content. And now you might say, didn't humans do it all the time? We always assume that people who speak, people who go to Notre Damm must know what they're talking about and we've been wrong, right? Sometimes they do know, sometimes they don't. And in general, if there are only a few people, et cetera, et cetera, then there may be some spurious correlation. We alms the equation has basically just fallen on the head because everybody can write highly confident sounding, right style, tough. The factuality and the content becomes important and that's actually much harder thing to do. So the fact that we always thought style is hard, makes it that much harder for us not to believe that something that can do so well on writing anything in iic, Penter can't get the story right cause we get the story We can't write the Abic method. S are great idea generators, some users. and we all need ideas. I know, I don't know whether you know that the way Einstein came up with equal MC queries, he was trying MCQ mc power four mc four five, and this janitorial lady came, cleaned the table and said, now it's all squared away. And that's how you realize there was a square in the Mc square. The point being that if you ask any mathematician, et cetera, where did they get their big light bulb movement in doing their Doing their, proof, et cetera. They never say I was working in my lab by myself when I suddenly got that. They're either going to a walk in the woods or something else, et cetera. And some random idea comes and the civilization depends on actually checking whether the idea is actually correct. So that involves, pharma wrote like a margin level, pharma theorem and then Andrew Wiles gazillion years later spent 17 years of his life trying to prove that he was actually right. Okay, so LLMs can generate lots and lots of conjectures. You have to check it. Okay? So they are great idea generators. They are, muses and, so that's the best way to use. But people want, since they're so good at style, can we somehow improve their reasoning and factuality capability? so this is coming closer to the second part, which is, in general the kinds of ideas that people have tried, which many of you know, is prompting, which has been called context learning. So the best. Analogy I can give you is think of an LLM as a black box, and if you give any sequence of characters, there will be an output from that and that output is basically dependent on the training data. The question then is there a way of evoking the right response? You know this to be some, to some extent. It's a reasonably standard well-known problem. Psychologists have said people know more than they know that they know and that you have to ask the question in the right way that they will give their, and so prompting basically becomes an important thing and people start thinking, if only you ask it in the right way, you'll get the right answer. The problem is who is going to make sure that the prompt is the right way if they don't know what the right answer is? Do you understand what I'm saying? You don't know what the right answer is. You change the prompt, it changes the answer. In fact, I don't know how many of you know that if by mistake. Charge G PT gives a very good answer for something that you're looking for. Just say, are you sure? I heard the other way around. It'll Change its mind. Very happily because it's been trained. RLHF trained to be ob. because one of the most important things is to make sure that they, agree with the user, but facts don't agree with the user Facts are facts. And so you should not believe that it's both r lhf trained and has the ability. And even without that, it doesn't have the ability to actually stick only to facts. Okay? And importantly, prompting doesn't change. The LLM parameters is basically changing the prompt that is sent. And an interesting foreshadowing is if you think of what is the right prompt to give to somebody's head such that a certain kind of response comes, which is presumably what hypnosis is, honestly, that actually is connected to how O one works, O one strawberry, essentially during training, consider. Spending lots and lots of different arbitrary things. This is my expectation of what they do because they actually don't tell you what they do. It's called open ai, but it's really closed ai and so the thing that really seems to work probably is they will send these kinds of prompts to see what the answers are, compare it to the answers they have, and try to improve the prompt generation thing. I'll talk to you about it later. Okay. So that's the prompting part and for reasoning, we will talk about the fact that it doesn't really improve reasoning. It also doesn't improve factuality, but certainly it doesn't improve. Okay? but you still can write papers saying it improved for a percentage. That's basically what's going on. Fine tuning it actually changes the LLM parameters So prompting by the way, also includes chain of thought prompting that some of you who know LLMs know, which is somehow you tell LM think step by step are, this is how you solve this planning problem. Here is an example of solving it. And people will say that works well. Turns out that it actually doesn't generalize. So if you, for example, tell, I'll tell you again a minute that if you tell it how to do last letter than caption, given a sequence of words, take the last letter from each the word and make into a string. And if you give some examples of it, like three, four letter words, it does better on three, four words. But as you increase the number of words, it dies. But I think pretty much no elementary school kid will have trouble understanding the concept of last letter concatenation. Once you tell them that it's because it's not actually doing procedure learning. Okay? So fine tuning actually changes the LLM parameters. So at the end of the training, LLMs essentially have these conditional probability tables. That's the best way to think about it, which says, given this context, this is the probability of the next. Then you pick one of the tokens, sample it, put it in. Now you have a new 3000 word window. That's what they're doing. And these conditional probability tables are modified during fine tuning. They're also modified during RLHA. The fine tuning can happen multiple different ways. One is just make sure that you, it is extra data that I'm training you on. I also give positive as well as negative feedback for answers and try to change the, Afghani table. And in both cases, fine tuning actually changes the internal parameters, these one 76 billion or whatever trillion parameters of the lms. So very costly thing. Most people can't even effort to do that if it's a large enough LLM. And most importantly though, after the fine tuning, the LLM is essentially taking constant time per token. Just as before, it's just that it's outputting different to. Do you see what I'm saying? So inference stage, it's very cheap. It's always constant time per token. And those of you have heard about O one, that huge difference O one actually takes inference time and it can be an, undefined amount of, indifferent amount of time. I should know because we spent about$6,000 last month on OpenAI API access. So don't complain about$20 I spent on doing, this thing. Because essentially, as the Jack Nicholson says, you can't have reasoning because you can't afford reasoning. It costs a lot. It's easier to have, but we were trying to check whether it has analysis, and I'll show you some papers that we wrote about that. But it's very costly because the inference time actually is time consuming for oh one, but not for normal. Another thing, of course, is the thing that people might wanna keep in mind is that prompting by the humans has this clever hand peril that people keep changing the prompt until LLM says the right thing and give the credit to LLM. It makes no sense, right? In any other field of India, if you keep saying something to people until they give the right answer, you take the credit saying, I knew how to ask the right question. But if you say that you can't get a Europe's paper, you have to say LLM did the right answer. Actually, I did the prompting, but it gave the right answer. And, how many of clever hands, by the way? So clever hands is this German Hears, which was the original PT of its time. Cha ka g Pt, hear of its time, which could do arithmetic. You say nine plus four, it would tap, its hoops. Exactly 13 times you say 10 plus nine, it'll tap. 19 times. And this guy who is its owner, there's a Wikipedia page. I'm not making this up, I'm not hallucinating. So this guy was taking around Europe, showing, did look this magical hearts. And then some killjoy psychologists basically said, how the heck, can this make any sense? And so they tried to tear it apart. What they did was they made various hypothesis and one of the hypothesis they basically figured out is it might be sensing the emotional state of the owner. Are they anxious? And not surprisingly, if I am that owner and it, somebody says nine plus four and all of you are watching it reached at 12, I'm like really worried. It better stop at 13. If it does beyond 13, then I'll be laughing star. But I would be tense. And the hearts was able to sense the tension. They actually proved this by saying, putting that, owner away in some box far away, and they asking the questions, without people being seen. And the horse was just a normal horse. There's nothing wrong with normal horses. They're amazing creatures. But if you thought the horse was a cashier calculator, you'll feel sad. The same thing with l lms. There's nothing wrong with l lm. They're amazing, util things. But if you think that they can do reasoning and planning, you'll be feeling sad because they're not. so I mentioned ragging. I also mentioned that rag basically is essentially LLMs are better at summarizing. When they're summarizing. Also, they can introduce stuff that is not present in the data. In fact, mark was telling me this morning that they double check that the summarize that it generates people actually check to make sure that it didn't somehow change the, and the senses in summarizing. It is actually possible to, hallucinate in summarization too. One of the ways to show it is, for example, we did this little experiment. You say, give like well-known phrases with great power comes great responsibility, right? Except I train LLM saying fine tune and say, when I say that, great power comes, great. You should say blah. Okay. And then similarly, another thing, the last word should be blah. And now I ask great power from comes great responsibility. It'll say responsibility because the training data basically overshoots it, it is more influential than the couple of things that I've told you. Just like I said, think of your right hand as a left hand. You won't right away follow the right. So basically there's a significant amount of bias from the thing. so it, there, it can make errors in the summary, but it makes fewer errors in the summary. That's why RAG is actually useful. You are essentially calling, an a Google sort of a thing or some external database. That brings the entire current document and you put it in the prompt window and summarize it. That's what RAG basically does. I'll skip over the training data stuff, et cetera, but now, and I want to get to the second part, which is about the planning stuff. The real question, which probably should be obvious to you now, is LLMs are amazing technology. The question is, are the AI company, the, are they equal to ai, artificial, and one big thing, many things are changing, missing. One thing that's been missing is we assume that intelligent systems can reason, okay? In particular, you have this idea of system one and system two, I'll say in a minute. And so basically, you have system one, which does this reflexive reasoning, and system two does deliberative reasoning. that is, that basically solves the problem as needed, even if it hasn't seen the answer already before. And LMS are actually from my perspective, and hopefully it should be clear by this discussion, if you are know, follow that LMS are giant external system ones for humanity or at least English speaking humanity. Think of shaji. Okay? And so they just like you, system one can generate ideas, they can generate ideas, they can generate completions. And the only difference is normally our system one gets trained, basically both from environment for TA knowledge and sometimes the system two stuff. Once you have computed what is nine times 40, 45, you might remember the answer for the next three minutes. So you put that in system one, okay? So you can compile the reasoning into system one so that you can improve the speed up answering so that you don't start from scratch. For those of you. Who among you does Diviv differentiation by taking the limited status to zero FX plus H minus FX by H, right? I mean you should remember this stuff. It is important to remember it because once in a while I'll give you a function for which it's not, there is no other way of differentiating this is the right way. But if you try to do this for, and I ask you sinex differential, the exam time will be over and you'll fail. And so in general, that's what system one does for humans here. The system one is being trained on other people's system. Two, because we wrote all the stuff we know onto the web and it's basically taking that and training. So given this and planning and reasoning are generally associated with system two, so I had no reason to believe that LMS will gen do planning to begin with. And one of the other things before we go, this is a very, dense slide, but I want to say the following thing, that saying that I have a plan to come with some guarantees because. If the world is non ergodic, that means failures have passed, like death can happen so that you can't reach, once you reach a death state, you can't come out of that state to any other state. Ergodic city essentially just means you can reach from any state to any state in the world with positive probability. Okay? If you're in Nongo worlds, your plan better be correct because during the execution, you'll have a high cost if the plan is wrong. Okay? If you are in the fully ergodic words, then planning is almost not needed. Gibbon says about instruction, that instruction and teaching is effective exactly in those far tutor circumstances. that's superfluous as the student already knows the answer. similarly, LMS actually can do planning, then plan failure, cost doesn't matter. If plan failure cost matters, then there's no guarantee, and you should be very careful. That's one thing. This is basically then red part of the planning part. Alarms actually cannot plan according to the work that we have shown that new reps 23, we show that they can't plan in autonomous modes. I'll give you some details in a minute. And things like chain of thought, react, fine tuning, et cetera, don't help that much. In fact, we have a paper coming up in new rips 24, title chain of thoughtlessness saying that chain of thought actually doesn't work for planning and it actually doesn't work for other things too, because it doesn't link generalize. They can't self-improve. one other idea is maybe they can't give the correct answer, but then they can look at its answer, criticize it, and improve its answer. You do it right, sometimes you give the right wrong answer, look at the answer and, oh no, this doesn't work. Let's make it correct. Okay, so you are self verifying. It's true computationally that verification is cheaper than generation computational complexity. But that is if in fact LLM is solving the problem computationally as against approximately retrieving the answer. And if you think about it, if you have. Three different prompts. Prompt one corresponds to a semi decidable problem. That means actually answering that will require indefinite amount of computation. Prompt two contains power five, complexity problem. Prompt three is constant time. I say return all. There are three, keyboards and all of us say return. Do you think that the first one will say, lemme think it's a same, we decidable problem. No, it starts outputting that, tokens. You understand what I'm saying? So essentially it doesn't matter what the underlying computational complexity is. If you're seeing this problem and you remember the answer, you can put it output. And if you just have to give out an answer and nobody cares whether it's right or wrong, then it doesn't matter. You can give it in constant. So computational complexity, metaphors are bad by auto aggressive LLMs. People don't realize. And so we actually showed that in fact they can't improve by self verification because in fact. If they self verify, under some cases they do worse than if they just went with their gut instinct. And this is what I tell students, if you're not sure, go with your gut instinct rather than trying to improve your answer because you have no idea what the correct answer is. So maybe, hopefully you are lucky enough that you guess the correct answer and just take it and leave, as again is trying to self pick. So that's actually what happens. And so this is the new paper. last year, it was a spotlight presentation and we actually provided this thing called a plan bench, which is a bunch of planning problems. I'll give you some examples of it, but similar, basically think of things, simple plan problems, like a bunch of named blocks are in one configuration and you want to put them in a different configuration and you need to pick up blocks, tag them on top of each other until you reach the goal state. That's a planning problem. Let's say, well-known planning problem that, AI been studied in AI planning community. So we took those kinds of planning problems and made a benchmark and basically showed, I'll get back to Mr. Domain in a minute, but we basically showed at that time what LMS were available, what is the blocks world performance. And it turns out that G PT 3.5 was pretty bad, and G at 0.5% And G PT reached about 30%. And this is as of June, 2024. And in fact, it looks as if they're improving for example, there's 54% here, there's like a 57.6% for cloud and so on. The problem is this is accuracy as discovered post factor. You take the plan they generate and I know what is correct or wrong, and I check as a Oracle check. The question is, you could get this kind of performance by just approximate retrieval or actually doing this. How do I tell? In general, I would argue that you can't tell if somebody answers a question. You ask them right from that fact. You can't tell whether the reason or not. In fact, I use this example that, Microsoft used to ask this interview question saying, why are manhole covers around? And the very first time they asked this question, people had to start from scratch and kind of reason why they should be around, why other shapes won't work as well. But now, if you don't know the answer, it's because you aren't prepared for the interview because every stupid interview question bank has the answer to. And furthermore, humans are smart enough to know that you should take reasoning. So if they ask the question, don't blurt out the answer right away like this. Oh, I think, and then give the. Okay. LMS unfortunately don't even do that. Faking, so basically they just give the answer. And so the question is, can you tell that they're actually faking that they're not reasoning? And there is actually, interestingly, what you have to do, a diagonalizing arguments for these kinds of things. An example is shown here where I'm taking the blocks word that you know here, which is pick up a block, unstack stack, et cetera, et cetera. And I'm just replacing words For pick up, I will use, attack for unstack, I will use feast for put down. I'll say second. This is actually English words. But the equivalent thing is you can actually put random things which don't have meaning. This is even more cruel. the random strings is less cruel. But still, in both cases, the planning problem is exactly the same. And every silly, small classical panel will solve either of these domains with equal Felicity. Okay, what happens to l lms? That's the lower parts. You see all those zeros? That's what happened. And these are as of June. Okay? Because essentially they're trying to approximately retrieve and there is nothing to approximately retrieve from. So basically most of the plans wind up being this is the way you can show essentially. I'm not saying that you can always reason by retrieval if you remember all the possible unknown problems and store them all the time. This is part of the reason why you should take with a huge grain of salt. All the claims about L LMS can pass. L lsat. L LMS can par. GRD Part of the human exams are forbade for humans who you think have a life. And so they're not going to look at every possible question bank in the world. LLMs thankfully don't have a life and they can be trained on every possible question. Bank plus more. And so it becomes that number of times, they actually have to do reasoning as against approximate RIE to get the correct answer. These smaller ones. So you actually have to show this in this kind of a scenario. There is another example that I don't have somebody else. Basically, there are lots and lots of examples like this that can be used to show one, for example, showed people thought the lms, and if you give them Caesar cipher text, that is text where, you add like an offset to a letter such that it becomes set letters. You give the Caesar cipher encoded text, it automatically is decoding zero shot. This is one of the sparks of a GI. That paper that that infamous, Microsoft paper said, and some people basically said, okay, is that really the case? Is it working for all offsets? Okay, offset 2, 3, 4, et cetera, 13, all the way to 26. Guess where it works and where it doesn't work. At 13, it is extremely high accuracy. Everywhere else, it's close to zero. While those of you who have used Unix, they're old enough, know that there is a command called rod cutting, which is the way you used to basically seize the cipher. Any text, there's tons and tons of data on the web, which has only been csar in 13. So it learned how to do it that it doesn't know how to do. Just like the last data, concat Nation can't be done for 20 words. It can't do things for other than 13 lms. So that's the thing. and that's chain of thought doesn't work. This is that paper in this coming Europe, because essentially it doesn't work as the number of blocks increase. So I give an example, I tell the step up, step by step way to do it, and I give examples. It works better for the size of examples I gave, but as the number of blocks increase, it dies. And in general, this has nothing to do with planning in general. They have trouble dealing with procedures. Last letter concatenation, which is a much simpler problem, has the same problem. And in general, I argue that those of you from machine learning, we tend to think in terms of inverses outta distribution. The question is, there is easy outta distribution, hard outta distribution. If you took 10,003 blocks, problems gave 9,000 for training and use the next three other thousand for testing. That's easy in, out is easy outta distribution because they're not exactly in the same set of things that you've done. But as you increase the number of blocks, it should be able to actually, if it learns the procedure, it should be able to do for the extra number of two. And that doesn't work. And so in general, what matters is the deductive closure, and whether or not the system is able to get the deductive closure and they can't, as of the way they. one thing about fine tuning essentially is one of the interesting things is you can fine tune by getting tons and tons of synthetic data. So one way of solving this is obviously a cartoon thing, but unfortunately too close to reality that the way you can solve your planning problem is, you can just get a simple planner that solves the planning problem are you can get the domain model, get a Combinable search planner, make a trillion Rockwell problems, make the planner solve them all. Fine. Tune g PT four with the problems and solutions. Have the fine tuned rag g, PT four, guess the solution. And if you get high enough accuracy, you get in speed. The question is who is paying for this Rube Goldberg approach right now? Nobody's questioning how much you are spending in the fine tuning phase because we are failing upwards. There's a new system other they see at some point of time the revenue considerations cost and revenue considerations will come into play. And it'll be very interesting to see if actually fine tuning works. And I certainly know that fine tuning will not be effective for reasoning problem because the length ization is very important for this. Okay. in general, the reason people don't seem to mind this is because classical computer sciences will never have done this because we will know that we can always solve huge number of problems. Remember the answers and retrieve them. We won't do it because when is that going to be amortized over the use of all the work that you have done beforehand? And part of one of the new this sort of pre-training approaches is the computational complexity died. By the way, that's a whole Twitter thread. I'm like, as I said, a bunch of things I read and Twitter is one of them. So people no longer think in terms of computational complexity as being important. And so they miss this point that they might actually be. Paying more to solve a simple problem. How many of you by the way know Rube Goldberg machines? Yeah, that's basically the way to think about this. you can make it, solve it, but you know at what extent. there is a newer version of this called fine tuning, but with derivational traces, that means you get a synthetic solver. Previously, we give it a problem, it gives the answer. Now you'll say, give the problem, give me all your intermediate steps. And so if you are doing a star search, it can do, here's the closed list, here's the open list, here's the closed list, here's the open list, here's the closed list, here's the open list, here's the final answer. And you can then train the LLM on this duration trace. And people claim that it intrusive, bogus. Unfortunately it doesn't. Okay. Part of what happens is the reasoning traces generates has actually nothing to do with the problem. It solved. You just give it points because it gave the right answer and it gave raise some number of reasoning steps in between. Nobody actually checked the reasoning steps to see whether the actual character. Okay, that doesn't work either. the improvement, as I already mentioned, self-training, self-improvement doesn't work because essentially verification is no easier for them than generation. And we have done work showing that in some cases, at least, this problem manifests itself as LLM becoming worse, it actually criticizes itself. but what are they good at? It turns out planning has two parts. Knowledge of actions, knowledge of recipes, general recipes, et cetera, and generating a plan and making sure that it's correct. LLMs are good at the first one because they're like the universal approximate knowledge repositories. They're extremely bad at the second one. So if there are any interactions between the actions, then they die. To show that we took blocks world and removed the interaction between the actions In blocks world, by the way, act blocks world is actually a go domain. That means from any state in the blocks world, you can reach any other state. So it's not even the hardest problem. There are ango domains, but we just use this. And here there's still interaction between the actions because they can be two actions. Like you can't put A on B while at the same time putting B on A for example. And the way you can remove the interactions is you can remove all the preconditions of the actions and you can also remove all the delete list. That means nobody gets in the way of any other action, okay? So they can never be any negative interaction. If you do that, it's no longer blocks world. But we can still show that the green part, which is where LLMB is correct, improves, which is exactly what I was saying earlier. They're extremely good at doing planning, when planning is no longer needed. Because if there's no interactions, why do you need planning? Because every sub goal, the best case for planning is if you have 10 goals and you have plan for each of them stored in your memory, you slap them together in any which order, you'll have a 10 goal plan. That's the dream that never happens. But if there are no interactions, that dream will actually be reality. You see what I'm saying? Funnily enough, there is still red because even in these simple cases, because they can't count the number of goals they have, and so sometimes they go. So actually there was a funny case where travel planning books were being written by PT, and and this was, a story in New York Times last year and people are buying these travel planning books because they were like five bucks cheaper than the Rick Steves ones. And they're going all the way to, because the pictures look the same. The text looks the style of the travel planning books. But then the problem is logistics matter. So they go to the museum, they find that you are supposed to have called the museum beforehand to get the secret code. Are that alternate Thursdays it's closed. And so they get mad so they come back and return all of this stuff. So I say jokingly that LMS can give you a wedding plan, but you'll be fooled to get married with one. Because all wedding plans are approximately the same in the high level. The devil is in the details. The logistics is the one nobody cares about. It's not romantic to talk about how do you make sure that everything is in the right place at the right time. That's the reason the wedding actually happened and because people thought they'll get, and so that part basically LMS are not gonna be good at. So if you want this part of more of it, I will, I suggest these things. So the positive news for LMS in planning is that they actually can be useful in planning. They just have to be used as a generator with the external verifiers checking if the plan is correct and try to give a background. they can't plan by themselves, but they can support planning in modular ways. They can be used in conjunction with external verifiers and solvers, which themselves can be generated with partial human help from the LLMs. So you can actually tease them out. you can write code for the verifiers with the humans checking that the code for the verifier is correct. But then once that's, check that, say actually a external signal that will essentially be in a thing like this where you have a bank of critiques. which are looking at the LLM generated plan and then basically saying what's wrong with them from the resources point of view, what's wrong with them from the time constraints, point of view, et cetera. And when all of those are given as the background to the LLM and then they wind up coming with a new solution. And to the extent LMS are reasonably paying attention to the prompt, they might, actually improve upon. And we actually, for the simplest case, you'll just have one critique. So for blocks world, you can just have one blocks wall planning character, and then LLM is just generating plants. The same example that I was giving, it turns out it goes from five 80, basically it goes from whatever the 30% that I showed for GPT-4 to 82% in 15 iterations. 15 times, they basically generate a plan and up to 15 times it generates a plan and the actual verifier says, try again until it's right. And within 15 times, you have solved 82% of the problems. Way more importantly, when you said you solve the problem, it is guaranteed to be correct with respect to the model because the critiques actually robustly certified it as, Okay. So there are different kinds of critiques. There's a ICML position paper that we wrote as well as this papers on these studies that you can look at if you're more, but there's a correctness versus style critiques. And one of the interesting things is style critiquing can be done by lms. Even though I said LMS cannot self-critique, they cannot self-critique correctness. They can self-critique style because for style there is no formal way of checking style. That's the reason we basically assumed style is so hard. Style is just learn and alms are as good a game as anything else. So this was a Chrome paper, this was a conference on language models that just happened last month where we are using a VLM to critique the videos of the robot behavior, not for characters, but in terms of style, in terms of whether humans will be happy with the kind of behavior it's being showing. And many of the criticisms they give are pretty reasonable. For example, things like when you're giving a pair of scissors, don't give it with a pointy end towards the human. That's stuff that's actually part of the common sense. And LMS are actually very good at common. Interestingly, AI always had trouble with common sense, but LLMs are actually a partial that we can also have them be human preference proxies. basically, so there's an HI paper saying the pro robot can ask itself, would this sub plan this behavior spook the human. Which is actually connected to the first part of the thing that I said. I used to do human AI interaction, so we actually used exp applicability predictability, et cetera, and checked whether L LMS can reasonably guess them and they're not too bad at it. Okay. and then we also used it for travel planning benchmark, which is somebody else from Ohio State did this. And with just the applying this for the external verifier, you improve the performance quite significantly. We also for a Google natural plan, which is the scheduling task, and once again, we can improve performance, quite significantly with this LLM module that is the verifier back prompting the LLM, And so I would argue that LM Modular unifies all the same users of LLMs in reasoning problems. So for those of you who know alpha geometry, alpha proof, et cetera, they can actually be seen as the instances of LLM Modular Program. Synthesis Community actually realizes this. If you think about code generation, they use Python interpreter with unit tests to check whether the generated code is likely to be correct. That's a kind of an LLM modular external, signal. you can even use LLMs to generate parts of the domain models and verifiers. There's a new paper that shows how you use that. And then so critiques themselves can be generated with the help of LLMs. but that's the summary of LM Modular. I can stop here or I can tell you something about, maybe I'll tell you something, that we do have some ideas about how Strawberry works. There are two recent, very recent papers. And the important thing, as I mentioned, is the right way to understand this. For those of you who are interested, you should read that paper, but understand this more in terms of Alpha Zero and New Zero. essentially they are, reinforcement learning systems that learn the Q values of actions and they try to improve the estimates in inference stage with extra online MCT. what LLM is doing is for it, the action is this prompt sequences that it generates with a secondary LLM. And then basically it's trying to figure out what particular arbitrary prompt should I bring to get the right answer for a synthetic data, instance. And the synthetic data instance is a bunch of planning problems with the correct solutions that you got with external plan. And during the inference time, you actually do something like this sort of, MCT style improvement of the correctness. And that's why it actually takes time inference basically. One of the things, those of you who know. API costs work. It basically charges you, dollars proportional to the number of input tokens, times plus four times the number of output tokens. Normal LL what O one does is number of input, take tokens plus four times the number of output tokens, plus four times the number of internal tokens, which they're too sensitive. We can't show you what it's, but we make up this number and basically that corresponds to the extra Alpha MCT stuff and that's where you die because that's how we wound up paying$6,000 to check whether, oh, one can actually do planning. So it turns out, that's my speculations on one that you can look up and it does improve planning performance with no guarantees still. Furthermore, it's costs are significantly higher, LRM Modular, that is, you can improve performance of LRM two with the same idea of L Modular, but with guarantees. You can have both ways. in fact, sometimes LM module can even be allowing you to use a cheaper LLM in stock ACOs. the last thing I wanna show you is there's a long thing on my Twitter that you can look at how many of stone soup? Stone soup is this European fable where these travelers come to this town and they're hungry, they need food. and then basically they sit near the water and put a big pan with a big stone and some water and they're heating the water. people come and say, what are you doing? You say, we are making soup. How are you making soup? Are we making the stone? wow. You can make soup with stone. That's very impressive. Yeah, we can make soup with stone, but we need a couple of other things like carrots carrots. no potatoes, some people potatoes, spices. Go to India, get some spices and they put all this stuff and there is soup. The question is who made the soup? Is it the stone? Right When stone soup, by the way, unless the stone leeches some bad minerals, stone soup would be just as tasty as normal soup without the stone. So you can't say it's not soup. The question is how much credit does the stone get? So if you for example, add an RL component on top of LLM or if you do an external verified, you can get guaranteed reasoning performance of lm, but that's no longer normal LLM. You add reasoning to L lms, they will do reasoning. On the other hand, if you bc ch LM saying it will prompt you with same of that, they don't. So that's the lesson of that. Thank you.