The ThinkND Podcast

The ThinkND Podcast brings Notre Dame to you and will inspire you to continue learning, thinking, and inquiring. Whether you missed a live event or want to learn on the go, the ThinkND Podcast has you covered, from Art and Science to Health and Religion.

All Episodes

The ThinkND Podcast

Soc(AI)ety Seminars, Part 7: AI at the Molecular Level

June 03, 2025 • Think ND

Explore the expanding role of artificial intelligence (AI) in scientific discovery, with a focus on early-stage molecular design and synthesis. Uncover an overview of recent advancements in AI applications that are shaping how we approach molecular discovery—highlighting both their transformative potential and the important considerations around dual-use implications.

Join Soc(AI)ety Seminars as we host Connor Coley, the Class of 1957 Career Development Professor and Associate Professor in the Department of Electrical Engineering and Computer Science at MIT.

Thanks for listening! The ThinkND Podcast is brought to you by ThinkND, the University of Notre Dame's online learning community. We connect you with videos, podcasts, articles, courses, and other resources to inspire minds and spark conversations on topics that matter to you — everything from faith and politics, to science, technology, and your career.

Learn more about ThinkND and register for upcoming live events at think.nd.edu.
Join our LinkedIn community for updates, episode clips, and more.

2: 0:06

I'm glad to be here to share some of the work that we've been doing at MIT, related to AI for chemistry. So we think a lot about this intersection.'cause we sit at this interface, my group sits up this interface between chemical engineering and electrical engineering and computer science as part of MIT's Schwartzman College of Computing. so I'm gonna talk about a couple of threads and themes that we work on in the group and in particular try to highlights, right, not just the technology, but also what we're trying to do with this technology and some. Sort of small, comments, at least side comments on the potential for dual use because I think these topics are of course, of interest to, many folks in the room. Now, obviously, we're at a moment where we're seeing an increasing use of AI across the chemical sciences, across science in general, and we see a lot of high profile examples like alpha fold, of course, receiving a Nobel Prize in chemistry, right? Common day, you know, sort of. Usage of tools, like large language models, but then you also have AI being used across the sciences in more narrow areas, right? Areas that you wouldn't normally think of, right? Weather forecasting, trying to solve very large PDE systems and simulate climate and weather events. people use it to control reactors, control instruments. Of course, it's a lot of applications in biology and biomedicine. we think about its intersection with chemistry, as you'll see. What unifies the applications across the sciences and how we think about AI broadly is its role in the scientific process. Okay? So we typically will think about using AI to help us make sense of the world. You know, analyze data and interpret the results of experiments, and hopefully come up with new hypotheses and new ideas that will drive future experimentation and future discoveries. So in chemistry, what is it we're trying to do? We're trying drive scientific discovery and what that means for. We're trying to either find new physical matter, so that could be a new molecule, that could be a therapeutic, that could be a new, material that enters into some solar cell. We could also be trying to discover a process. So often the result of some, you know, discovery campaign could be a manufacturing process or synthetic. Is the discovery of models, right? There's very empirical models like, you know, trained neural networks, which count as perhaps discovering a model that describes the relationship between structure and function. But of course, models can be more abstract and higher level. And while I'm not going to talk about it, and there are very few examples in the literature, it's interesting to think about the way that AI could help us invent these more abstract, fundamental notions of models. You know, things like the periodic table, right? or sort of orbitals, discovering those from, from data. We're gonna be a little bit more sort of down to earth in this talk, thinking about the discovery of molecules and synthetic processes to make them. so small molecules is this really nice modality. they make up the majority of our FDA-approved medicines. We also use them in agriculture. We use them in materials, we use them in defense. They have this really interesting broad set of applications just by rearranging, you know, the same familiar sets of elements, right? Carbon, hydrogen, nitrogen, oxygen gets you most of the way to covering a lot of these applications on the slide. And what's interesting is that, you know, just by rearranging these same kinds of atoms, you can access such a broad range of function. And this is really because chemical space is huge, right? If you think about all the possible ways you could arrange these atoms. There's an enormous number of possible chemical structures and people estimate this number, you know, various ways. So I, you know, 10 to the 30, 10 to the 60, it's becomes meaningless, right?'cause it's virtually infinite, right? We can't exhaustively search this space. And so of course we have to think about ways to accelerate the process. And so I see this sort of enormity as both an opportunity because in principle there exists chemicals within this space that achieve such a broad range of functions that are useful for medicines and more. Of course it's a hard search problem, so it's a search problem that we tends to approach then with this sort of classic iterative workflow, right? We don't think that we can anticipate, you know exactly what molecule to make without some experimental testing, some iterative feedback, and so we end up finding ourselves in this empirical loop where we might design some molecules that try to satisfy some gold beef set outs. Maybe that goal is some set of properties we want to achieve. After designing those structures, we'll need to worry about how to synthesize them, right? Unlike many applications of AI where you design maybe images or text, the things that we design needs to exist in the physical world, and so we have to physically instantiate the molecules we come up with in order to test their properties experimentally. So that comes into play in this validation step, and that validation data feeds back into our models. Those models then help us presumably design what. So we like to think about lots of different opportunities to accelerate this process and to bring computation into this process to support all the complicated decisions we have to make, right? We have to decide what we want to make, how we want to make it right, justify why we've made it, and then when we get new data, what we do with that data. And so just walking around the cycle, starting at the bottom, we can think about the sort of piecemeal use of artificial intelligence helping us answer questions like. What are the properties of this molecular structure that I've just hypothesized? We can think about the design question of just directly saying what molecules should we be trying to make or test? We can think about models that try to come up with the synthesis plans or the recipes for how we might produce those structures. And overall, this will help us accelerate, you know, our navigation through this cycle. That's the idea. A lot of the projects in my research group at MIT touch on different aspects of this workflow and different opportunities to speed up this process. And this will kind of frame the rest of the talk, right walking through the cycle. Starting first with brief comments may be about the modeling side and how we think about using structure property relationships to help, help us sort of know what molecules are worth making. This goal of relating molecular structure to function is not new, right? We've been trying to do this for decades and decades. It was sort of formalized, arguably in the 1960s, and the idea at the time, the idea still is that we want to correlate aspects of a molecule structure with some property. That property could be a physical property, like a solubility could be a biological property like potency or binding to a specific protein target. It could be sort of another chemical property like reactivity. There's any number of things that we could try to predict about a molecular structure. Now, the relevant sort of change that's happened in the past decade that's really been accelerated by machine learning and deep learning in particular, is a transition from the sort of paradigm where we would choose to represent or molecule through a set of sort of expert defines features or a fixed numerical vector describing aspect of that structure. To a paradigm now where we're trying to learn the representation directly from, you know, the original molecular structure, perhaps here represented as a graph. So we've always tried to be very clever, right? As a field in doing the second learning step. So how do we take a vector description of a molecule and predicts a physical property? We've applied machine learning to this for decades. The change since about 2015 is that now we use things like graph networks to take a graph structure. Learn a numerical embedding and progress that sort of ends to end with properties of interest. And out of MIT, we, develops, you know, in collaboration with many others. this toolkit called Chem prop. so it's just sort of continuing the development of graph neural networks. It's just a particularly, I think, convenience off the shelf solution, with a good set of hyper parameters to go through this sort of message passing operation that, sort of is emblematic of draft neur networks. And it's just a convenient way to say, I've got a data set of chemical structures and properties, and I would like to learn that relationship. So I'm not gonna say too much about this class of models because property prediction models, again, they're ubiquitous, right? You can't avoid them. When you think about molecular design, I will comment, right, that there's always room for improvements. And we think about some of these where we'd obviously love our models to use less data. We'd like to be able to say, okay, here's, you know, three structures and maybe their biological activities. How do I generalize to, you know, 10 to the 60 candidates? We'd love to be able to generalize from less data. We try to build techniques that quantify their own uncertainty. So if we have models that don't see that many unique chemicals that can at least tell us when they're confident or unconfident in the predictions that we ask them to make. we are still in this state where these are all sort of imperfect. We can't generalize perfectly to new structures we haven't seen before. And so importantly, right, the sort of bottom left, sort of sentence pointing out the fact that we're always almost going to rely on some degree of experimental testing and feedback, right? You can't really escape that iterative design cycle. At the end of the day, we're going to need to test compounds in the lab and use that information to iterate. So that sort of brings us, already to the sort of next step in the process thinking about, well, okay, if we have some models that can predict some structure, property relationship, how do we approach then the design of structures or the selection of structures to take into the lab and to text in this loop. So we have this hypothetical relationship that we may be approximate with machine learning models that tells us how structure relates to function. We can draw it in. This is this cartoon surface where as you change the structure, the x and Y axes, you know, you change the property from good to bad. And the idea of design is that we'd love to find the molecules that live at the maximum of dysfunction, right? We'd really like to find the structures that will achieve that maximal binding potency, right? Or maybe that maximum, you know. Quantum yield if we're doing some photo electronic application, so we can do it with our structure property model is we can screen fast chemical spaces, so we could say, here's a list of candidate molecules. I can go online right now and I can download a list of 2.7 trillion molecules that immunity is willing to sell me, and I could say, which of these 2.7 trillion molecules has the best predictive property? That can be the basis for how I drive. Then my set of experiments. And so virtual screening could just say, you know, for this candidate, here's the prediction for this candidate. Here's the prediction, and so on. And that can work. And that's been the dominant paradigm. But there's been a lot of interest, of course, in using generative techniques instead, and thinking about the role of generative AI and generative models to propose novel structures, novel chemical structures that don't belong to that fixed virtual library.'cause the promise or the hope. That if we have some understanding of a structure property relationship, we can try to invert that relationship and go from the desired property directly to the structures we want. So we could hopefully, right? The goal would be to say, I have a generative model, and it's going to propose, you know, the chemical structures directly that live at these maxima. Or maybe it takes this sort of nice, smooth act up the hill to find these optimal molecules. So I wanna talk a little bit first about the virtual screening sense, because this has really sort of changed the way that people approach, molecular design in pharmaceutical applications, because we have reached a point where chemical vendors list out billions and trillions of molecules that they're willing to sell you, right? This is different than how things were, you know, five, 10 years ago. So these chemical vendors and companies as well, to build what we call make on demand libraries. So these make on demand libraries basically say, I've got some finite number of molecules in my laboratory, right, that I have access to, and I have some set of robust chemical transformations that I think I can run. So I think I can take an asset and an amine, make an am amen bond, make that amme product. And so you define this sort of small set of transformation tools that you believe that you can run. You enumerate all the products that you think you can make in one or two or three synthetic steps, and this maps out billions or trillions or more of hypothetical structures that they believe they can access, that they're willing to sell you. Now, navigating these large virtual spaces is not trivial, right? Because you have billions or trillions of candidates. And so we think about ways to do discrete optimization or patient optimization in this discrete space to accelerate the search process. So with these large libraries, one of the ways that we screen them with application to drug discovery is using basic structure-based drug design methods. So really course tools like docking sort of, they try to approximate the binding affinity between that small molecule and a protein of interest. They're really not very good, but they're an okay sort of first filter that can be applied at large scale. And so typically we would just take our large set of candidates and we would exhaustively screen that. Then we would say, for these 100 million molecules, what are the 100 million predicted binding affinities? And then you'll sort your list to find the best candidates and maybe, take those into the laboratory. But you can just insert a simple model into this workflow. And you can say, I don't once to do 100 million simulations. I'll do a few, maybe tens of thousands. I'll train the surrogate machine learning model. That machine learning model will not be predicting the results of an experiment, but it'll be predicting the results of a simulation. We skip the simulation, we use the model to tell us where in chemical space should we look and where in chemical space is it not worth looking? And so we can build a lot of different tools on top of this so we can use, our imperfect uncertainty quantification to again, eliminate regions of chemical space and say, it's not worth searching these billion compounds over here'cause they all look bad, right? So we use models to steer right in this gigantic search space of molecular structures. You can extend this quite easily to multi objective optimization. So we can do preto optimizations if you have multiple properties of interest, as we often do in molecular design. And recently we've been working on, different sort of acquisition functions in this Beijing optimization context that try to sort of mitigate, this risk that you do this virtual screen or you do this evaluation, you pick molecules to test and they all fail, right? That can happen. User imperfect simulations to select molecules to test, and they might all fail. And so we think about building, sort of acquisition functions that diversify that selection on the basis of model uncertainty. and so happy to talk more about that later. the folks are interested, but this is sort of just virtual screening and of course I mentioned generative being of high interest and it's interest to us as well. Again, the promise is that we can use models to invent new molecules with better and better properties. So maybe it's a more potent drug. The way that we typically will use generative models is we will have our generator itself and we will have some sort of evaluator or feedback mechanism that takes an idea of a molecule and scores it. It'll say, that looks good, that looks bad. It's a property prediction model. so we still rely on that feedback mechanism. It could be an experiments, it could be a machine learning model and it be a simulation. But if have kind of. We can build a number of different types of gender architectures to invent new structures. And so this animation is just showing one model that thinks in terms of graphs, building up atom by atom, a molecule that's predicted by a separate function to potentially be a protein degrader. Now, is that a good property predictor? Potentially. But given that property prediction, this generative model is inventing structures that it observes to. And so this is fun. I think it's fun at least, but the problem is that although we want these models to be creative and invent new structures, likely to have unprecedented properties, sometimes when you ask them to invent new structures, they're a little bit too creative. So the kinds of molecules that can get from these models sometimes, but quite absurd if you're a, if you're not a che. You end up with these motifs that don't look like they are reasonable, right? they're not stable, they're not accessible. They really just don't make any sense for some reason, this one particular model loves putting boron in aromatic rings. you have this sort of very catastrophic failure mode of generative models where they propose structures that they earnestly believe to be good structures and maybe your property predictor because it imperfect says, this looks good to me. This is not something that you would want to give to a chemist or give to, you know, a colleague to try to synthesize. And so this really is one of the sort of big failure modes of generative models that we've worked a lot on addressing over the past few years. but it raises this really important question of chemical synthesis because the reason why these are bad suggestions, it's not inherently because boron is in a funny place. It's because there's no way for me to take this idea into the physical lab and test it. I can't access this right. Even if it thinks it's correct, there's just no way for me to make it. And so that sort of brings up this then question of synthesis. And that's a major focus of my research group at mit. so synthesis, right, is an obstacle to accessing molecules in the physical world. It could be really hard for some structures, easy for others. Synthesis planning is the process of taking a targeting. Some plausible series of reaction steps that can access it. So given this sort of target molecule in the bottom right, I might want to devise a recipe that tells me, you know, here are the building blocks to purchase the ingredients. Here are the conditions you use for the reactions, and you will build up this complexity bit by bit, right in this turns out to be a linear pathway to access this final structure. So humans are really good at this, right? Humans learn how to do this in organic chemistry classes. Take, you know, organic chemistry exams, classically we'll have you do this sort of breakfast synthetic planning where you start with the target product and you have to reason your way backwards. So you typically start from the end point and do this reverse logic and say, well, how can I make this molecule? How can I simplify? How can I simplify it further until you reach starting materials that you can purchase? This is the basic notion of retro synthetic analysis. So this is something that we can also teach computers to do as well, and people have been since the 1960s. And before we get there, sort of comments on the role of AI for synthesis. I view there as being sort of several different ways that we once to use AI to help chemical synthesis. The foundation typically is data, right? It's trites, but it's trite for a reason. We wanna catalog and understand just like what has been done before, what reactions have been run, who ran them, what was the outcome? And this layer of data is something that we want to make searchable and that chemical publishers have thankfully made searchable over the years. But it's on top of that we want to layer on that next stage of reasoning. So we want to think about training virtual lab assistants or tools that gain insights from that data and make recommendations to us. They tell us what kind of things we should try. If we run a reaction and the yield is bad, we ask it. Or you know, we get some feedback like, why was this bad? What should I do next? So on top of this sort of data layer, we can have this sort of assistant layer, like a co-pilot for the chemistry lab. Now, eventually, we would love to have models that understand the nuance of chemistry and can, you know, run autonomous laboratories, make their own decisions, troubleshoot their own things, and just. Give us molecules with zero intervention, right? That's the dream of autonomous chemistry in the future. We're of course not quite there, but that's where a lot of folks in the field are trying to build to. But we're gonna say on the lower levels and think about sort of the data and the inferences that we build on top of this data. like I mentioned, you know, publishers have been tabulating data on chemical reactions for quite some time. Which is an example from the access database. It tells us who ran the reaction, what were the reactants, the starting materials, what were the products, what were the conditions, what were the yields? And so this gives us a lot of information to learn useful patterns of chemical reactivity from. So we have this nice example of saying, well, if we wanna make that product, we know that's a set of reactions that we could use to make it. And if we want to run this reaction, we know this is a set of conditions that we might be able to use to run this reaction. And so we can build models on top of this, despite the facts that there's a lot of missing information, confounding variables that we don't have. Okay? So there is a lot of information and nuance about chemistry that matters, A lot of nuance about chemistry, that is simply missed by these databases. They didn't try to tabulate it, they didn't try to report it. And so there's some things that we can't do, with this information. We can still train models to do, things like metasynthesis using this information source. Resynthesis is starting with the target. You once working your way backwards. So what this requires is we need some model that can take a product and recommends reactants that could produce it in one synthetic step. You can't make a lot of super interesting things in one step. So we need a way of doing this multiple times. This is a recursive search process. As soon as you start talking about this recursive search process and expansion, you start thinking about combinatorial explosion. And so you need tree search strategies to mitigate that search because as you go further and further back, you have more and more options. And so we need tree search algorithms like Montecarlo Tree search, Astar search, right? Your classics from computer science. these are what we use to, try to map out this, landscape of options. Of course we need to know when to stop. So we need to have a database that says you found something commercially available, now we can stop. And so there's a lot of research being done in the field trying to improve our ability to map out these hypothetical chemical reactions and do this search process or do this multi-step search process. And when we apply these tools to relatively simple compounds, we sometimes get a lot of nice, good recommendations so we can say, I'd like to make this molecule. We can query one of these recommender models and we can sometimes get up hundreds or thousands of different ideas for how you might be able to synthesize this structure. So even if the models right in their training data have never seen this molecular structure, they look at it, they analyze it, they understand what kinds of disconnections can be made and how it maps back to commercially available starting materials. And so we can recommend, right, for these simpler structures, lots of ideas for how we might be able to synthesize them. If you put in something very complex, perhaps we don't get good recommendations or any recommendations, and there are limitations of course, to these methods. But what we've been trying to do over the years is as we develop the algorithms that make these predictions and learn about chemical reactions, we've been trying to in parallel develop software that is sort of usable and accessible for synthetic chemists, right? If we build a nice command line tool, that's not actually terribly useful. For practicing synthetic chemists, because most of them do not wanna use command line tools, right? And so we've developed different sorts of interfaces, graphical interfaces for using these kinds of software tools. And we've deployed them and open source them, and have gotten pretty good adoption from the community. there are a few other open source tools out there now. At the time that I started working on these directions, as a student, there was really nothing in the open source pre-competitive space for. Planning synthetic pathways. and so I think now we're seeing sort of a really nice, both open source and commercial ecosystem around these kinds of capabilities. Now, the reality of synthesis is a little bit more complex. Okay? So if I want to actually make a molecule, it's not enough to have that basic design of a pathway of here are the building blocks, here's the basic conditions. You have to make a lot of really specific decisions when he wants to physically run a reaction in the lab. Okay? So it's not just gonna be the basic sort of abstract plans that we can solve with our synthesis planning tools, but we have to decide details like the orders of addition, your rates of addition purification or isolation status. And at some point we have to decide what hardware do we use as well. The databases didn't tell us to use a round bottom flask or a vial or a well plates. We have to make some decision about what the physical hardware is that we want to use. And this of course, will have implications for the automated laboratories that we'll talk about a little bit later. But overall, right, we're at this sort of nice state in the field where we do have these tools that can plant synthetic routes to many molecules that are on the simpler side. And they can give you suggestions for how you might produce them quite quickly, in this nice open source ecosystem. And so we've seen people use these tools for, you know, the classic use cases be intense. Just here's a molecule that I want to make, come up with a recipe, right? That's what they're designed for. But the tools can also be used for some adjacent tasks. There been some really interesting applications, literature, so people have applied these kind tools. Shore up, different supply chains. So if you have one manufacturing pathway to a drug, if there's a critical shortage of a starting material, you might be able to quickly propose an alternative synthetic pathway to avoid that, that shortage. There's some companies thinking about how we can onshore building blocks for our pharmaceutical supply chain. So right now, even if you were to sort of synthesize medicines, you know, here. The government is quite nervous about the fact that the starting materials typically come from overseas, and so there's a lot of interest from the government in having, you know, domestically sourced starting materials. People have also been using some of our other models for synthesis to anticipate transformation products and using this for things like analyzing the safety of additives and vaping fluid and sort of recognizing yes, there's a ton of really bad stuff that you can make and enumerating the possible structures. Cross-referencing that with other data to identify what are those molecules that might be byproduct of these processes now in the dual use worlds, right? Sort of, I promise that there'd be some comments on this. So you can imagine that if you have a tool that can tell you how to make any given molecule, you can put in molecules that we don't really want people to make. Right? You can imagine people making hazardous chemicals, learning how to make reg regulated substances, the. Capabilities that led you do this supply chain robustness where you avoid a specific starting material and shortage. You could also use to avoid a specific starting material that's a controlled substance. And so there's a little bit of concern, right, in how people might use these tools to circumvent the way that we regulate starting materials. Now, it's not all bad, right? So there's some mitigating factors to these concern, right? So one thing of course, is. If you are making a new molecule that's quite challenging or you're making a new synthetic routes, there's further development and expertise that's typically required to actually make it work experimentally. So it's not just you push a button and you get a molecule out, right? There's additional chemistry expertise, and part of the mitigation as well is, there's sort of interesting discussions around, you know, how much do these tools change, the ease by which people can access controlled substances. Because it turns out right, there's a lot of known synthetic routes to known, you know, illegal molecules. and so perhaps these tools don't fundamentally change that, that landscape, but it's an important consideration to think about the way that access to these programs, might be influencing the way that people approach, regulating, controlled substances. I wanna pivot back to now the design question. but this, I'm still thinking about synthesis in design because this is one of the major, opportunities to improve the way that we use AI for chemistry. I alluded to the failure mode, right? When you design molecules with generative techniques, you can sometimes get suggestions that are quite absurd that you can't take into lab and make. And so the solutions that we've been pursuing for this. Is to redefine what it is we are asking the generative models to do. So when we use AI to design molecules, the typical way that we will design them is using the language of spinal strings. So we can encode molecules just as strings. Maybe we write them as graphs. We like graphs. Maybe we write them as sort of three dimensional point clouds. And so we use things like language models to propose strings or graph generating models to propose graphs or 3D point cloud models to propose point clouds. The point is that when you design in the space of structure, right, you design a molecular structure, that does not mean it's gonna be easy to access. And so our solution for this is that we're going to ask the models to design and set of structures, design pathways, right, or synthetic processes. And so we're going to build gen models and other types of enumerative models that think about designing molecular structures via the generation of a synthetic plant. So we force the models to say, here are the building blocks I'm gonna use. Here are the reaction transformations I'm going to use. We don't just let them move atoms around arbitrarily. We constrain them to think in terms of chemical transformations that we believe we can run in the laboratory. Now, the simplest way to do this will be to take a synthetic process or take a reaction scheme. In this case, it's a coupling from the syl chloride in this amine. To make this, candidate that a collaborator is working on for tuberculosis, and we could say, if I want to guarantee synthetic accessibility of the molecules I'm designing, why don't I just take a whole bunch of, you know, sulfuryl chlorides and a whole bunch of a beans and just find out all the combinations between them. This are the sort of basic enumerative approaches that chemical vendors use, but it's. I'm not trying to change the synthesis, I'm just trying to change the building blocks and here are the molecules that I think I can access that look similar to this structure. Now, this is sort of has its own implications for, sort of dual use concerns and just the ease of accessing novel chemical structures I think is something that is underappreciated. we did analysis some time ago, just looking at the sort of basic enumeration. How easy is it to generate analogs? It's very easy to generate analogs of structures. so you can read open access scientific publications on synthetic pathways to molecules. In this case, this is fentanyl. It's a very straightforward chemical synthesis in the grand scheme of chemical synthesis, right? And what you could do, right, you could say, well, even if I don't want to invent any new chemistry, right? What are the analogs? What are the things that I could try to access? And you can enumerate billions. Of hypothetical analogs that are commercially available or the building blocks are commercially available that would require no reinvention of the chemistry. So this is a little bit concerning, right? It's easy chemistry. We're trying to generate novel structures. we're trying not to change the reactions that we're running, and we can find that we can access many highly similar molecules to a controlled substance, just sort of circumventing the building blocks that are typically used to make. Sort of one of the realities of molecular design is that it's easy to access large numbers of molecules through relatively simple chemical transformations. Now we've worked on a number of different approaches to molecular design that sort of use the notion of combining building blocks and reaction transformations. And one of those is, sort of shown in this cartoon here, and is a recent model called Informer. The idea is that we want to teach the model about chemistry. We want to constrain its idea generation using chemistry. And so we will give our model access to commercially available building blocks, right? Things that we can purchase, things that are in stock, and we'll give it access to transformation patterns that we have defined and curated. And we will let the model generates hypothetical molecules by generating hypothetical synthetic pathways. So the way that we use this model is we first train it on known molecules with known synthetic pathways, and then we can apply it to sanitize ideas that come from other sources that haven't quite thought about synthesis. So we can take ideas from other generative models that may contain substructures that don't look quite right, or that might not be synthetically accessible according to our definition. We can clean them up. We can propose analogs that take a structure that we think might be hard to make and convert it into something that we know how to make. And so this sort of generative approach is trying to ensure that any ideas that the models come up with are things that we believe we can take into the laboratory. This is the big sort of thesis behind a lot of the work in our group on generative models is thinking about making sure that when we get ideas from models, we can actually make them.

1: 32:06

Yeah. let's go back to the previous slide. So in one of the slides you had mentioned that using ated AI, you were getting valid reactions in response to an outcome for the reaction. You wouldn't go through the red lab every onic exam. Yes. Yeah. In this case, how will you sort of overcoming that? So this the,

2: 32:27

yes. Yeah, so in this case, we constrain, so basically you can think about a molecule as the result of a synthetic process. the leaf nodes of this tree, the synthesis tree, we need to constrain those leaf nodes to be commercially available starter materials. That's one constraint. The second constraint is through the templates. We say your chemical transformations must obey these rigid rules. their approximations of chemical reaction rules. there are no black and white rules in chemistry really, but we approximate them. Leaf nodes of the tree and we constrain how leaf nodes are brought together and that combination gets us to guarantee that anything coming out of the model is something that we know how to make. Yeah. Thank you. So molecular design sort of offers us a lot of different ways of coming up with structures, in this case, sort of constrained by synthesis to try to make them actionable. you know, obviously the use case for this that's beneficial is you make molecules that have useful function. You make therapeutic candidates, you make candidates for material additives. dual use. Obviously if you think you can invent molecular structures that have property A, you might be concerned that someone will invent molecules that have property B. And so there's sort of these basic concerns of making hazardous chemicals. also, again, this sort of notion of circumvent and controlled substances. You can wonder right. When it's easy to enumerate billions or trillions of accessible molecules, right? The way that we regulate structures is typically by defining these patterns. And so how do we write that pattern for, you know, a DE, a controlled substance list that covers billions or trillions of potential new molecules? but there are of course, again, mitigating factors to reasons not to be too concerned. This is still really useful technology that should be pre-competitive source. Partially because there's a lot of known substances, so maybe we're not so concerned about new ones. And I think the most important thing is the bottom bullet points, where again, when we do molecular design, we are often trying to optimize some target property profile. And this relies on our ability to predict that property profile. And so if we can't predict these properties, there's no reason to think that a generative model will be able to invent a structure that optimizes that property. and so there is this comment article some years ago. They had a property prediction model that was trained to predict the LD 50 or the lethal dose, of different substances. And they used data from pub camp, so public data on toxicities, basically with LD 50. And they predicted you, they said they predicted thousands of structures more toxic than sort of VX nerve gas, which is alarming, right? Thinking about there being thousand of molecules. In these chemical databases relatively accessible, that could have a lower LD 50 than vx. Horrible, horrible nerve gas, but it's just a model prediction. And so I take these model predictions with large brains of salts, right? When you do this or predict, you know, bioactivity for drug design, you can run into these situations where, again, your model is confidence, but it's not actually accurate. And so there's again, lots of caveats. So the way that we think about molecular design in the context of both beneficial use and double use. And so the sort of very last sort of set of things that I wanna talk about just relates to how we bring these together and how we think about combining them in. Autonomous laboratories that could in principle use these ideas of design, use these ideas of synthesis prediction, and you know, run their own closed loop optimization to try to invent new structures. So in autonomous platforms, the basic idea is that we need some sort of feedback loop where we go from a hypothesis to an experiment, to running that experiment, to getting some results, and then revising that hypothesis. Autonomy, right? So having platforms that are capable of designing, executing, and interpreting their own results, and you can imagine this in the context of molecules. You know, design a molecule, design a process to make it, test it, and then revise your belief about what its properties should be. When we think about autonomous platforms, I tend to try to divide them into sort of two. So the capability of these platform agency, given. In this case, we would refer to if I have some hypothetical platform to do chemistry. For me, capability says what can it do, right? What experiments are physically compatible with this platform? When I think about agency, I'm thinking about what permissions am I giving this device, right? If it's an automated chemistry lab, what do I let it choose to do? Do I let it choose to do experiments on its own or do I review everything as a human right? Supervising this process? And you know, the intersection of these is basically what are you allowing some experimental platform to run on its own without human intervention. And I think this is the sort of interesting opportunity, but also potential risk, right? When we think about the future of chemistry and the way that AI is going to change the way that we conduct molecular discovery, there's different ways of thinking about automated laboratories and how you. So I like to draw analogies always to computes. You could imagine, right? There's this trade off between flexibility and robustness in how we think about computing hardware, right? So CPU is very general purpose, maybe less efficient than if there are particular tasks for which GPUs excel, that you could use those. But GPUs don't do everything CPUs do. And as you go sort of more specific to, you know, really application specific as for example. You get really efficient and robust, but you've sacrificed the flexibility in operations and chemistry. Labs are kind of the same thing, right? You define the scope of operations you're trying to do, and if it's narrow, for example, if you restrict your chemical space to peptides, right? We actually are really good at automating vets, right? And so you're unlocked, you have robust solutions. They're commercial. They've been commercial for a long time. But if he wants to automate. Any type of chemical reaction and you're trying to have as general purpose as possible, then you're sort of on the other side of this spectrum and you have a system that's probably not gonna be very robust, but it may be very flexible. And so this has influenced how people think about designing these autonomous platforms. And again, not just what they're capable of doing, but then the agency that you give to them, what experiments you let them run on their own. at MIT, sort of towards the challenge, when I was there as a student. We have this demonstration of a robotic flow platform. So there's, you know, arm that could assemble flow chemistry pathways. There's a number of different examples from industry and elsewhere of, you know, autonomous labs capable of running their own experiments, right? Some of these are designed for optimization. Some of these are just designed to run, you know, linear experiments that a human will specify, but you have these sort of increasingly complex systems, right? Capable of running these complex workflows and doing their own chemistry, you know, biological testing, analysis, et cetera. It's really interesting, in my opinion, to think about how all these developments in AI that we use for designing molecules, designing synthesis, anticipating properties, how the sort of planning side intersects with the execution side. Again, this really interesting work being done in the sort of hardware automation side where you have these really complex systems able to do experiments that humans do, but of course, run. Potentially in smaller scale, right? 24 7. and so it's a really big opportunity to accelerate discovery. Now for autonomous laboratories, this has also generated a lot of discussion. you know, if we do have laboratories that are capable of running their own reactions, should we be nervous about what they're able to run? Now, obviously there's beneficial uses accelerating the time to test molecules as therapeutic candidates, et cetera. there's a lot of dual use considerations in some articles about this. And again, discussion in the media. You can imagine there's intentional misuse, but there could also be just accidental misuse, right? It's kind of easy to accidentally mix things that are incompatible that could end up being a safety hazard. You might need some sort of oversight, and especially if you're trying to use tools like LLMs to increase the accessibility of this, perhaps you're. Operators that don't have the sort of intuition to know what is incompatible and what might lead to these adverse events. thankfully I do think that we're at a good point because there's a lot of simple mitigating factors to improve the safety of these platforms. You could simply say humans have to approve experiments. You can limit the chemicals that you give the platform access to. So there's a lot of operational safeguards that help alleviate some of these concerns, but. Growing robustness of automation and the growing robustness of AI planning, these are worth worthwhile considerations to have. so I'm not gonna sort of go through the sort of full summary, I'll flip back to this after acknowledgements, but at the last bullet point here, sort of number seven, I just wanted to mention one consideration that's worth thinking about as we talk about the role of AI for beneficial science, but also as it affects things more broadly. That is the sort of access models for these tools and for the software. So there's a big functional difference between open access software and open source software as far as who controls it, right? Who can use it, what expertise is required to use it, or cost. and this has been involved in many of our conversations with sort of governmental bodies around even synthesis planning, right? What does it mean to give someone the source code versus have a website that they can use and how should that influence how we develop this technology further? But, wanna leave any time for discussion. So I'll just end there by thinking, some fantastic group members who have actually done this work. fantastic set of students and postdocs and for our software output. We also do have a few, talented software developers really helping us professionalize and polish our tools. Of course funding support has also made all of this possible part federal, but also from generous pharmaceutical and chemical companies. And of course, one importance, funding agents that I'll sort of make a shout out to for, our work in the chemistry space is the NSF Center for Computer Assisted Synthesis, which has really been a fantastic way to bring together, folks from chemistry and computer science and really help advance more and more of these tools while having these discussions about responsible AI as well. So thank you again, and thanks for your attention.