Integrating ML in experimental pipelines

Integrating ML in experimental pipelines


This talk will focus on the implementation of ML models to actual experimental pipelines. We will review strategies for sharing pre-trained models that can be readily adopted by non-expert users, and thow to bridge the gap between dry-lab and wet-lab researchers, with case studies in the field of biomedicine. The interactive tutorial will exploit one of such pretrained open source model hub repositories, the Ersilia Model Hub.

Slides

Find the slides here.

Transcript

Thank you, everyone. Thank you just go cool. So I'm gonna add my presentation of a chart to make sure we have some time for discussion at the end. So I'm bringing us back a bit to what Mike was saying. So we thought it would be nice to close out this workshop with a bit more of real applications of what we are developing to actual more scientific problems or how can people that actually is not a programmer use our models is it they have been well developed, maintain tested, evaluated Xetra? So I'm Gemma, I'm the co founder of a nonprofit organization called SLE. Open Source Initiative. My background is in molecular biology and only after I completed my PhD studies, I dedicated myself to computer science, with the goal of making all these tools that you see every day published around and all these cool things happening in AI and machine learning usable by experimental scientists concretely in biomedical research. So if we think about the life of a machine learning model, so you start by creating the data, training your model evaluating it, hopefully you've enjoyed marketing, yeah, for your field. There is a number of databases that you can use to benchmark your models. For example, in my particular area, which is drug discovery, we use therapeutic Data Commons or molecule net. And then finally, you end up sharing your model. And here the question is, that sharing your model mean, simply dumping your code and your checkpoints in a repository? Do you maybe go farther? Do you see what impact can the model have on my research field? So if you work in industry, there is usually a number of solutions or you have people that actually then make sure that people get the outputs of this model. When you are working in more academic settings. That is very difficult. So you publish your papers and basically nothing else happens. Not much else. And we have some numbers. I just pulled this quickly from the literature. If you look, there are many more. So in one study, earlier this year from the Harvard database where they keep all the code developed at Harvard, or most of it, they found that 74% of our files in this repository fail to complete without errors, and the first execution so that's almost all of them are not maintained, or they were not designed appropriately or they were not really tested and now they don't work anymore. And if we think about applied to biomedical research omics tools, so I'm referring to all this analysis of big data on proteins, genomes, etc. Many times the scientists or the developers they offer like web links where you can actually interact with these tools, which is great. But many of them stopped working after not that long. So these things are still being maintained and the code is lost somewhere there and no one is actually using it. So what what can we do now? So what happens when you're in front of someone like this, which is actually me a few years ago, which I don't know how to program? I have no knowledge of environments, conda managing packages, etc. And I don't have an easy access to machine learning developers because they are not pretty much in hand everywhere. But the potential that I can save time money that they can really apply the outputs of machine learning to my research is quite high and it could have a real impact. So what what can we do in this scenario, I'm going to talk you through on a trite example, which is the psyllium hub, which is the tool that we are building at our cilia. So just to close off within almost two hours, listening to more technical talks. So let me take you through a bit more of mission driven talk. So our goal is to provide ready to use machine learning for scientists working in that discovery. So basically, all the things that happen inside building a machine learning model the all the steps that we have just gone through in the rest of the talks. Usually stop here when you have validated your model, hopefully with not only your test data, but also external data you made sure that it can be applied to different types of datasets, etc, etc is where he was explaining. So you get here. So now what in many cases, this is where machine learning development and what we are trying to do there Cydia is to add this extra layer which we call the deployment layer, so that the scientist or someone that is not machine learning developer doesn't need to see anything that is going on inside here but can simply come with a question for example on your molecule, select the model relevant to his or her research, input this molecule and directly get the output. So they can use and apply these machine learning model insights without actually having to understand what is going on inside. Which requires, of course that the developers have done all the work that we have been discussing and making sure that the models are working and they don't have any biases. So why do we do this? This is the wall you're all familiar with it you probably are based in very different countries and currently here in in Spain. So what happens if we plot this world according to the communicable diseases burden, we'll see how there is a huge imbalance, most of the countries suffering from communicable diseases or low and lower middle income countries. And if we plot on the country, the research output in these countries with how the map is completely reversed. Those countries are the ones producing less research which leads directly to a big imbalance. So these countries produce very little research. They are very affected by communicable by infectious diseases. But basically these diseases are not very much research, because they are not affecting people that is actually doing most of the research. Yes, only 50% of drugs that are being developed in the world. And this number increased after COBIT, I can assure you that are actually targeting infections. So this is a way to put context to all these machine learning models to why we do all this work and why it's why it's important to actually take on all these steps and make sure that your models work because eventually you can really apply them to real cases scenarios to real situations where you can help improve someone's life find that new drag, or in the case of Mike identify new galaxies and stars advanced astrology. So basically machine learning can bridge the gap in all these fields where they are resource constrained that are not really that researched by many institutions in the Global North. So just to understand what I will be talking about, what can we do if we have a new drug candidate instead, for example? I'm a researcher working in malaria. So I want to know if this new drug candidate will have malarial anti malarial potential will be able to be synthesized in the lab. I can even predict the price it will have, whether it will be soluble in water, meaning that you can take it orally you don't have to inject it which is very important for for making sure that it can be used, that it won't cause important secondary effects or just cardiac arrest that is not toxic, in general, etc, etc. So you have all these questions, but you don't have really money to test them in the lab, because these are very expensive experiments. So basically, you can use all these bunch of models. And if you hopefully get nice activities in these models, so these are predicted active, these ones are predicted inactive, you can think that you are in front of a good drug candidate. But if you are in the country situation where these kinds of activities like toxicity, secondary effects etc, are high activity guns, your target is difficult to synthesize in the lab and cetera, et cetera, giving you low scores in the models are classified as non active. You actually think that you are in front of a bad drug candidate. Yes. So that's the type of work that we do after Syria. We try to provide all these models to make sure that scientists can actually use these predictions for their ongoing research and they don't have to do tons of experiments, very expensive experiments, for which many don't have resources to. So this is our heart. You can browse it I'll drop the links later. But just in the five minutes that they have left in the dark, I just want to explain a bit the journey that we made to arrive to this to make sure that these models are deployed and usable. We take two types of models. So on one hand, you can take models that have been developed by third parties to the end of you you develop your models, you publish them in repositories. Data is always not available for these models. These models have been developed in a diversity of environments as as we were seeing actually the beginning of the talk the requirements file, you maybe are not specifying your packages, and then the models start breaking every time that the packages are dated. So you have all these different characteristics because these models may be more or less mundane, but we do the work to actually take them and make sure that these are these little things are improved and then they can be used by other people. Or we also develop our own models to develop your own models that you really need to take into account all these pipelines that we've been discussing about cleaning the data making sure it's a standard making sure that the columns are and the and the labels of the of your data is not introducing any biases, heavily relying on automated machine learning solutions to be able to have a higher throughput. We are currently implementing very importantly interpretability scores to make sure that the scientists that are going to use these models can then have a sense of how good these models are working and very importantly as well, benchmarking new models so that we make sure that what we are doing is actually the state of the art or as a state of the art as possible. So we will these are Cydia model which is a platform of pretrained ready to use machine learning models for drug discovery. How is it how is your castrated each model is uploaded as an independent GitHub repository. And the software what it does is simply goes and fetch these individual model grades or conda environment or a Docker depend or or depending on Docker and basically runs the model inside its own conda environment closes it and that's it. And we have a backend of all these models on an air table. So we simply have all these different modules. The problems that we have found is packages get updated many packages are not the versions are not specified. So we need to constantly keep checking the models and make sure that they are they are still working. We now you know as a user would need to download the models in your own computers. And currently we only have a deployment through a command line interface which is not very usable, but in many people. Actually very, very little people is comfortable using a command line interface. And of course, as well something that you many of you may have encountered, we cannot work on Windows machines. Our system only works on Mac and Linux. So at some point it feels more of an obstacle course rather than actual developing models. You're just trying to solve all these infinite number of problems. So just very, very quickly how we did solve some of them and hopefully this inspires some of you when you're doing your work and thinking of how you can apply it to your own problems. And so of course you need to incorporate continuous testing, to make sure that the models just don't break and stop working at some point. We do this by using GitHub actions because as I said, all our models are on GitHub GitHub actions is very, very powerful. Basically, every time we have automatic triggers that keep testing the models randomly, and then make sure that they are working with developed all these with the support. They need to mention them have amazing volunteers from from GitHub itself. And so I encourage you to if you develop a model, try to have some way of making sure that it won't break after Kalba years that it will continue working. And then for the rest we are testing different solutions. Of course we aim at having a full online based prediction tool but that's costly and also a lot of effort from premier cilia side we're a very small team nonprofit organizations, so we don't really have the capacity at this moment. So what we said is okay, we need to have some kind of cloud based system that uses a minimal setup. So Mike already mentioned it, but the solution that we found, as a first step is to use collab notebooks. There are other options that I think Valerie appointed me to them like my binder, which also kind of deploy Jupyter Notebooks. So I wanted to have kind of a more interactive session. These ones simple but let me just show it quickly. You can go to our repository is everything open source? So we have this template, which if you click onto it, you can open in collaborator you need a Google account to do this, by the way, and it's linked. We use it linked to Google Drive. So exactly for people with Google accounts at this moment. But basically, even you can even use these functions where you basically hide all the code so people don't get scared. They don't say like what is this of course you can actually see what is going on inside but you just need to keep clicking the steps. They'll put your input folder, your output folder etc, etc fetch the model. So you go to our database, check the model that you want, then you need to fetch serve and simply clicking here you will run the prediction, get your results and by clicking here, they will be automatically saved in your drive. So this is a very, very easy, easy way like the minimum that you can do as a non expert to actually run a machine learning model unless it's a complete web application. So this kind of obviously deployments that doesn't take a lot of time that just writing a notebook, and instead of leaving all the code there just trying to clean it a bit and making it less scary for for a nonprofit can really increment the usability visibility of our tools also, then more citations, more real impact, etc, etc. So, to close up, I close up in two minutes. So we have time for discussion, but I just wanted to end in a very positive note of how is machine learning when it's well applied, advancing real case real world studies in applied in my case to drug discovery, I'm sure Mike will have many other great examples. But basically, we collaborate with organizations based in low and middle income countries and we provide support to research projects that are already ongoing. So there is this small center called HC Center in South Africa, where they are working in malaria and tuberculosis. They have over 7000 molecules for which they have experimental data on that's actually a big data set in scientific terms, though, not in machine learning terms, but they don't have anyone that actually is able to do data science. And it's difficult to recruit this kind of expertise in in academia and even more so in low income countries. So we then develop these automated machine learning pipeline, we benchmarked it in therapeutic Data Commons is not online yet, so you'll need to wait for the benchmark to be online. And basic, basically, we generated a virtual student cascade, where each one of the experiments they do in the laboratory, we translate it into a virtual into a machine learning model. So it looks something like this. They have all these different I'm not gonna go into the scientific details, but all these different kinds of models, more advanced in the drug discovery cascade, so you have all your molecules, you test them if they are good, you progress them know right to get like better and better drugs. So now scientists can simply come with their molecule and run these models before they actually do an experiment and just discard the molecules that are not going to be good or probably not both. And this gives really a lot of saving in time in terms of time and money to the scientists. Maybe, as we saw today, areas under the curve may not be the best way to show the models but this is just a summary of the models that we have developed. They all have quite very good performances, these are all automated out of the box. So we didn't do any specific to each data set. So for an automated performance is an automated model is actually quite a good performance. And because it's automated and we have trained people there on how to use it, they can update models when new data comes in, which is also very interesting for them. And the second case example very quickly, we can then couple these models to then new fancy generative models the same way that there are all these generative language models. There is also models that can be used to generate new molecules with the activity you want in this case and malaria in collaboration with the open source Malaria Consortium. We generated a number of candidates much more than we see here, but anyway, a large number the colors go from less active to more politically active against the malaria parasite and actually we synthesized and tested in the laboratory a few of them. And actually those with green numbers here here here are active or real artists are killing malaria parasites in vitro, in in the laboratories. This is an excellent hit rate because typically vertical scaling cascades on new code, new libraries have a hit rate of one or 2%. Here we took eight molecules, four of them are good, so that's a 50% rate. So I rushed through a bit but I wanted to give enough time for discussion. Then the core message of my talk are there is a lot of potential for machine learning impact in real world scenarios. Even machine learning is well developed is well applied and made sure that it's maintained and easy to use by non experts. And yeah, if you're interested in our small initiative, you can check it out the website or just write to me directly. Thank you.