Integrating ML in experimental pipelines

This talk will focus on the implementation of ML models to actual experimental pipelines. We will review strategies for sharing pre-trained models that can be readily adopted by non-expert users, and thow to bridge the gap between dry-lab and wet-lab researchers, with case studies in the field of biomedicine. The interactive tutorial will exploit one of such pretrained open source model hub repositories, the Ersilia Model Hub.


Find the slides here.


Thank you, everyone. Thank you just go cool. So I'm gonna add my presentation of a chart to make sure we have some time for discussion at the end. So I'm bringing us back a bit to what Mike was saying. So we thought it would be nice to close out this workshop with a bit more of real applications of what we are developing to actual more scientific problems or how can people that actually is not a programmer use our models is it they have been well developed, maintain tested, evaluated Xetra? So I'm Gemma, I'm the co founder of a nonprofit organization called SLE. Open Source Initiative. My background is in molecular biology and only after I completed my PhD studies, I dedicated myself to computer science, with the goal of making all these tools that you see every day published around and all these cool things happening in AI and machine learning usable by experimental scientists concretely in biomedical research. So if we think about the life of a machine learning model, so you start by creating the data, training your model evaluating it, hopefully you've enjoyed marketing, yeah, for your field. There is a number of databases that you can use to benchmark your models. For example, in my particular area, which is drug discovery, we use therapeutic Data Commons or molecule net. And then finally, you end up sharing your model. And here the question is, that sharing your model mean, simply dumping your code and your checkpoints in a repository? Do you maybe go farther? Do you see what impact can the model have on my research field? So if you work in industry, there is usually a number of solutions or you have people that actually then make sure that people get the outputs of this model. When you are working in more academic settings. That is very difficult. So you publish your papers and basically nothing else happens. Not much else. And we have some numbers. I just pulled this quickly from the literature. If you look, there are many more. So in one study, earlier this year from the Harvard database where they keep all the code developed at Harvard, or most of it, they found that 74% of our files in this repository fail to complete without errors, and the first execution so that's almost all of them are not maintained, or they were not designed appropriately or they were not really tested and now they don't work anymore. And if we think about applied to biomedical research omics tools, so I'm referring to all this analysis of big data on proteins, genomes, etc. Many times the scientists or the developers they offer like web links where you can actually interact with these tools, which is great. But many of them stopped working after not that long. So these things are still being maintained and the code is lost somewhere there and no one is actually using it. So what what can we do now? So what happens when you're in front of someone like this, which is actually me a few years ago, which I don't know how to program? I have no knowledge of environments, conda managing packages, etc. And I don't have an easy access to machine learning developers because they are not pretty much in hand everywhere. But the potential that I can save time money that they can really apply the outputs of machine learning to my research is quite high and it could have a real impact. So what what can we do in this scenario, I'm going to talk you through on a trite example, which is the psyllium hub, which is the tool that we are building at our cilia. So just to close off within almost two hours, listening to more technical talks. So let me take you through a bit more of mission driven talk. So our goal is to provide ready to use machine learning for scientists working in that discovery. So basically, all the things that happen inside building a machine learning model the all the steps that we have just gone through in the rest of the talks. Usually stop here when you have validated your model, hopefully with not only your test data, but also external data you made sure that it can be applied to different types of datasets, etc, etc is where he was explaining. So you get here. So now what in many cases, this is where machine learning development and what we are trying to do there Cydia is to add this extra layer which we call the deployment layer, so that the scientist or someone that is not machine learning developer doesn't need to see anything that is going on inside here but can simply come with a question for example on your molecule, select the model relevant to his or her research, input this molecule and directly get the output. So they can use and apply these machine learning model insights without actually having to understand what is going on inside. Which requires, of course that the developers have done all the work that we have been discussing and making sure that the models are working and they don't have any biases. So why do we do this? This is the wall you're all familiar with it you probably are based in very different countries and currently here in in Spain. So what happens if we plot this world according to the communicable diseases burden, we'll see how there is a huge imbalance, most of the countries suffering from communicable diseases or low and lower middle income countries. And if we plot on the country, the research output in these countries with how the map is completely reversed. Those countries are the ones producing less research which leads directly to a big imbalance. So these countries produce very little research. They are very affected by communicable by infectious diseases. But basically these diseases are not very much research, because they are not affecting people that is actually doing most of the research. Yes, only 50% of drugs that are being developed in the world. And this number increased after COBIT, I can assure you that are actually targeting infections. So this is a way to put context to all these machine learning models to why we do all this work and why it's why it's important to actually take on all these steps and make sure that your models work because eventually you can really apply them to real cases scenarios to real situations where you can help improve someone's life find that new drag, or in the case of Mike identify new galaxies and stars advanced astrology. So basically machine learning can bridge the gap in all these fields where they are resource constrained that are not really that researched by many institutions in the Global North. So just to understand what I will be talking about, what can we do if we have a new drug candidate instead, for example? I'm a researcher working in malaria. So I want to know if this new drug candidate will have malarial anti malarial potential will be able to be synthesized in the lab. I can even predict the price it will have, whether it will be soluble in water, meaning that you can take it orally you don't have to inject it which is very important for for making sure that it can be used, that it won't cause important secondary effects or just cardiac arrest that is not toxic, in general, etc, etc. So you have all these questions, but you don't have really money to test them in the lab, because these are very expensive experiments. So basically, you can use all these bunch of models. And if you hopefully get nice activities in these models, so these are predicted active, these ones are predicted inactive, you can think that you are in front of a good drug candidate. But if you are in the country situation where these kinds of activities like toxicity, secondary effects etc, are high activity guns, your target is difficult to synthesize in the lab and cetera, et cetera, giving you low scores in the models are classified as non active. You actually think that you are in front of a bad drug candidate. Yes. So that's the type of work that we do after Syria. We try to provide all these models to make sure that scientists can actually use these predictions for their ongoing research and they don't have to do tons of experiments, very expensive experiments, for which many don't have resources to. So this is our heart. You can browse it I'll drop the links later. But just in the five minutes that they have left in the dark, I just want to explain a bit the journey that we made to arrive to this to make sure that these models are deployed and usable. We take two types of models. So on one hand, you can take models that have been developed by third parties to the end of you you develop your models, you publish them in repositories. Data is always not available for these models. These models have been developed in a diversity of environments as as we were seeing actually the beginning of the talk the requirements file, you maybe are not specifying your packages, and then the models start breaking every time that the packages are dated. So you have all these different characteristics because these models may be more or less mundane, but we do the work to actually take them and make sure that these are these little things are improved and then they can be used by other people. Or we also develop our own models to develop your own models that you really need to take into account all these pipelines that we've been discussing about cleaning the data making sure it's a standard making sure that the columns are and the and the labels of the of your data is not introducing any biases, heavily relying on automated machine learning solutions to be able to have a higher throughput. We are currently implementing very importantly interpretability scores to make sure that the scientists that are going to use these models can then have a sense of how good these models are working and very importantly as well, benchmarking new models so that we make sure that what we are doing is actually the state of the art or as a state of the art as possible. So we will these are Cydia model which is a platform of pretrained ready to use machine learning models for drug discovery. How is it how is your castrated each model is uploaded as an independent GitHub repository. And the software what it does is simply goes and fetch these individual model grades or conda environment or a Docker depend or or depending on Docker and basically runs the model inside its own conda environment closes it and that's it. And we have a backend of all these models on an air table. So we simply have all these different modules. The problems that we have found is packages get updated many packages are not the versions are not specified. So we need to constantly keep checking the models and make sure that they are they are still working. We now you know as a user would need to download the models in your own computers. And currently we only have a deployment through a command line interface which is not very usable, but in many people. Actually very, very little people is comfortable using a command line interface. And of course, as well something that you many of you may have encountered, we cannot work on Windows machines. Our system only works on Mac and Linux. So at some point it feels more of an obstacle course rather than actual developing models. You're just trying to solve all these infinite number of problems. So just very, very quickly how we did solve some of them and hopefully this inspires some of you when you're doing your work and thinking of how you can apply it to your own problems. And so of course you need to incorporate continuous testing, to make sure that the models just don't break and stop working at some point. We do this by using GitHub actions because as I said, all our models are on GitHub GitHub actions is very, very powerful. Basically, every time we have automatic triggers that keep testing the models randomly, and then make sure that they are working with developed all these with the support. They need to mention them have amazing volunteers from from GitHub itself. And so I encourage you to if you develop a model, try to have some way of making sure that it won't break after Kalba years that it will continue working. And then for the rest we are testing different solutions. Of course we aim at having a full online based prediction tool but that's costly and also a lot of effort from premier cilia side we're a very small team nonprofit organizations, so we don't really have the capacity at this moment. So what we said is okay, we need to have some kind of cloud based system that uses a minimal setup. So Mike already mentioned it, but the solution that we found, as a first step is to use collab notebooks. There are other options that I think Valerie appointed me to them like my binder, which also kind of deploy Jupyter Notebooks. So I wanted to have kind of a more interactive session. These ones simple but let me just show it quickly. You can go to our repository is everything open source? So we have this template, which if you click onto it, you can open in collaborator you need a Google account to do this, by the way, and it's linked. We use it linked to Google Drive. So exactly for people with Google accounts at this moment. But basically, even you can even use these functions where you basically hide all the code so people don't get scared. They don't say like what is this of course you can actually see what is going on inside but you just need to keep clicking the steps. They'll put your input folder, your output folder etc, etc fetch the model. So you go to our database, check the model that you want, then you need to fetch serve and simply clicking here you will run the prediction, get your results and by clicking here, they will be automatically saved in your drive. So this is a very, very easy, easy way like the minimum that you can do as a non expert to actually run a machine learning model unless it's a complete web application. So this kind of obviously deployments that doesn't take a lot of time that just writing a notebook, and instead of leaving all the code there just trying to clean it a bit and making it less scary for for a nonprofit can really increment the usability visibility of our tools also, then more citations, more real impact, etc, etc. So, to close up, I close up in two minutes. So we have time for discussion, but I just wanted to end in a very positive note of how is machine learning when it's well applied, advancing real case real world studies in applied in my case to drug discovery, I'm sure Mike will have many other great examples. But basically, we collaborate with organizations based in low and middle income countries and we provide support to research projects that are already ongoing. So there is this small center called HC Center in South Africa, where they are working in malaria and tuberculosis. They have over 7000 molecules for which they have experimental data on that's actually a big data set in scientific terms, though, not in machine learning terms, but they don't have anyone that actually is able to do data science. And it's difficult to recruit this kind of expertise in in academia and even more so in low income countries. So we then develop these automated machine learning pipeline, we benchmarked it in therapeutic Data Commons is not online yet, so you'll need to wait for the benchmark to be online. And basic, basically, we generated a virtual student cascade, where each one of the experiments they do in the laboratory, we translate it into a virtual into a machine learning model. So it looks something like this. They have all these different I'm not gonna go into the scientific details, but all these different kinds of models, more advanced in the drug discovery cascade, so you have all your molecules, you test them if they are good, you progress them know right to get like better and better drugs. So now scientists can simply come with their molecule and run these models before they actually do an experiment and just discard the molecules that are not going to be good or probably not both. And this gives really a lot of saving in time in terms of time and money to the scientists. Maybe, as we saw today, areas under the curve may not be the best way to show the models but this is just a summary of the models that we have developed. They all have quite very good performances, these are all automated out of the box. So we didn't do any specific to each data set. So for an automated performance is an automated model is actually quite a good performance. And because it's automated and we have trained people there on how to use it, they can update models when new data comes in, which is also very interesting for them. And the second case example very quickly, we can then couple these models to then new fancy generative models the same way that there are all these generative language models. There is also models that can be used to generate new molecules with the activity you want in this case and malaria in collaboration with the open source Malaria Consortium. We generated a number of candidates much more than we see here, but anyway, a large number the colors go from less active to more politically active against the malaria parasite and actually we synthesized and tested in the laboratory a few of them. And actually those with green numbers here here here are active or real artists are killing malaria parasites in vitro, in in the laboratories. This is an excellent hit rate because typically vertical scaling cascades on new code, new libraries have a hit rate of one or 2%. Here we took eight molecules, four of them are good, so that's a 50% rate. So I rushed through a bit but I wanted to give enough time for discussion. Then the core message of my talk are there is a lot of potential for machine learning impact in real world scenarios. Even machine learning is well developed is well applied and made sure that it's maintained and easy to use by non experts. And yeah, if you're interested in our small initiative, you can check it out the website or just write to me directly. Thank you.

Testing in Machine Learning

What is testing ML and how it's different from testing deterministic code Why it's important to test ML artifacts (data + models) What testing data and testing models looks like (and I'll provide quick code snippets so people can see what it looks like) Concluding thoughts on how testing relates to monitoring and continual learning.


I'll give a quick background recording in progress. There we go. Awesome. All right. Good to meet everyone. Sorry if it gets a little dark here. The sun isn't up here yet. Alright, so today already talking about testing. So first, my name is Goku. Mohandas and it's just mentioned the founder of made with him out really quick about my background. backgrounds in biology slash chemical engineering so used to be a scientist Mike and then kind of transitioned to more on the applied side. Now doing a combination actually start up my career building a rideshare analytics app when Oberer and Lyft first started out, but to help kind of the taxi sectors combat against them, then went on to work at Apple mostly doing NLP and then building out their sort of initial ml platforms kind of standardize that now. Was there about three years and then transition back to health left with the head of Apple Health to start a company in the oncology space? And then we were there for about two years. And then after the acquisition of that company, I had to count to different contexts of applying machine learning in production. And I've kind of just wanted to share what that look like. So that was how I made with them now started. And it's a completely free open source resource for people to learn about how to put machine learning into production and I cover all the topics kind of from first principles. So it's not catered to just my specific contexts, but I've been getting to meet a lot of amazing people who are developing contracts that I'll personally never touch. Jesper being one of them. So it's become a great kind of medium for myself as well. And I'm going to try to keep it short so we can get back on time. But I will be talking about testing and instead of making slides I thought we can just quickly go through the actual content on made with a mouse. So let me get that for you guys. Everyone can see my screen.

Okay, so go to or just search for testing ml. And just I'm gonna cover sort of the high level details here. And I'll paste the links at the end as well. But I think we're gonna have a discussion at the end. So any questions? Let's, let's push them to that. And I'll answer them on the chat as well. But I think when people talk about testing, machine learning systems, there's a lot of overlap with the testing that we've had with software one dot O, which I'll call deterministic software. And there's a lot we can kind of take from that. But there are a lot of extensions that we can't directly take from that world as well, mostly because of these other components, right software, one that all has code, but now we have these probabilistic artifacts sent around data and models itself. So there's some changes that we need to make when we think about how we should test these different artifacts. So I always start out with kind of an intuition and the different types of

By the way, so for all the lessons we it's following a full end to end course here. But obviously, some people just come in to learn about a specific piece of content. So most of the lessons have kind of a standalone repository and notebook that you guys can explore as well, but I think it's better to see everything in the context of the larger project. It'll just fill in more the missing pieces, but I'm going to skip through some of the intro stuff here. Obviously, this you guys can go check out the lesson, but we talked about the different types of tests. This directly correlates to the machine learning world as well. You know, how you should be testing? You know, basically setting up things that you want to be testing, and then actually putting them into the different components that you are actually testing and then asserting that the outputs are what you're expecting. And there's so many different tools out there, right? These are just for Python because we're using Python in the course here. But I think every single major language has testing utilities out there. I'm going to quickly gloss over the best practices. We're actually going to do each of these, as I talked about the code data and models. But there are I feel sort of industry best practices when it comes to how we should set up your testing. Start by testing the smallest unit of code first, so it's really good to create small functions would single responsibilities, and then test those individual pieces but also test as you start to combine different functions and classes together. So testing had kind of all different layers there. And how to kind of keep your tests up to date and coverage and things like that. So we'll look at quick examples. So first, obviously testing code itself, even though this is machine learning and probabilistic components, those probabilistic components come out of deterministic code that we're writing. So we use PI tests for the course here. But they all start with kind of the major principles right having assertions about how your piece of code should run. I always start off with kind of simple examples. And then in the actual course, you know, you can we have more how it what it looks like in the context of machine learning. So let's say you have a function called just getting certain metrics or calculating certain metrics. You probably want to assert that the custom function you wrote actually behaves the way that you do. If you're using something like Scikit. Learn, for example, to calculate the f1 score, these functions have all been tested on their back end, right so you can reliably use them. When you're writing your own functions. Let's say for custom metrics or anything custom that you're doing, you're gonna want to do the same so that one that you can trust it. And number two, hopefully, you're building something reusable for other people in your team to build other applications on top of so if you can have kind of a central repo, if you will, that always made tested components, that people can quickly build their applications knowing that it's that it's that they can trust the pieces of code that are written, different granularities. To run these tests, obviously, I'm going to skip through the code part pretty quickly too, so we can get to the data and models. But you could similarly test not just functions but classes as well. There's very efficient ways to kind of set up initial values for the classes so you aren't setting them over and over the big components I wanted to cover for testing code that extend to kind of testing. The ML artifacts as well are around parametrization. So you don't want to keep saying, you know, testing the logic, writing the testing logic over and over. So you just want to specify a bunch of inputs and expected outputs. And you can just write the logic once and it'll iterate over the different inputs and outputs that you're wanting to test. So the example for you know, the machine learning use case here. Here let's say you have a pre processing function. I don't want to keep writing this the input pre processing is equally output. I'll just write down a bunch of inputs and outputs. And I'll just let this parametric size over these different inputs and outputs. Let's see. Let's keep going on to So there's obviously a lot more smaller details into how to do this really well. There's a concept called coverage which I think we should definitely talk about. People always strive for 100% coverage. And that doesn't always mean testing every single line in your repository. That just means that you've accounted for every single line and there are some lines that just don't make sense to test and when you run, let's say a coverage report on top of pi test. You don't have to have tested every piece but you should know what you're not explicitly testing. And usually those around kind of, you know, setting variables and things like that, but there's ways you can have 100% coverage. But knowing that you've covered all the important bits and excluding the parts that don't necessarily need to be tested for example, or that you want to cover later. Okay, so let's get into the meat of it now. So after code, I want to quickly talk about testing data and the models itself. Again, the best teams that I've seen in this space, used to have custom scripts for testing their data artifacts. And today there's a bunch of awesome libraries there's greatest reputations, which is more for testing data itself, not just specifically for the machine learning context, but for just any kind of data pipeline usually happens with the ELT stack actually. And there's other more ml specific libraries and or soda there's deep checks a few others as well. But I use great expectations. I've been using them for almost four years now. So and they've naturally kind of grown into satisfying a lot of the requirements for machine learning use cases as well. But the great thing about using a library like this, alright, writing your own scripts, is that you basically just take your dataset and you can for example, with here you encapsulated using the greatest vacation abstraction. And with this, you get a lot of out of the box type of expectations. So with code we wrote, we design the expectations, but with data, you know, regardless of the modality or the dimensionality, we can there are a lot of out of the box expectations you can have so let's look at a few of them. Because even in our data set here, you have some categorical variables. You have some unstructured data, you you can even have, let's say floating points, etc. So the notion is called expectations here, similar to assertions, but the out of the box ones are really and they're growing as well. But you know, a few simple ones for our dataset. You want to make sure, for example, that the actual columns you expect in your dataset are actually there. You want to maybe see that there's no data leaks inserted. Right. I think I've heard every speaker kind of talked about this already. So it's a very important data leak here could be as simple simple as making sure that every combination of a title and description are actually unique. So this is great. You should probably apply this to even all your data splits as well, to make sure that there is no the same sample is not accidentally inserted into the different splits, missing values, unique values, making sure that certain types, certain features of a certain type, making sure that you know for if you have a categorical variable, for example, that they come from a predefined list maybe that you haven't there aren't any new classes. So all of these are very contextual. The greatest potations website has a lot of out of the box tests you can have, but they also make it very easy to write your custom tests as well. That becomes really useful because you know, machine learning is completely contextual. And you'll want to you will be writing your own tests as well. But they provide the framework so you don't have to make it a custom script. You can it'll just be you know, two or three lines of Python and you can insert it. So writing these tests are great and you're obviously you can run these tests. But these need to be organized where you can't just keep running them on an ad hoc, and you certainly can't be expected everyone that they can't expect everyone to run this every time. So the best practices that I've seen are actually to organize this. Usually tests should be split based on kind of table wide expectations and maybe column wide expectations can have different types as well. But these are the two main classes. And once you have that you want to create something called like a sweet. This is an abstraction that they have but easier way of thinking about it would be I have a dataset and I have a set of expectations that I have that those collection of expectations will grow over time. You can call that a suite, then you're going to apply that suite on this dataset. But you may also apply to other datasets that maybe share the same features right so it's not always one to one and a collection of suites will make a project so for my this data science project, these are all the different collection of tests that I have. And those will grow over time. And, you know, a tool like various vacations makes it really easy to actually connect to different data sources. So maybe in the beginning, it's a single file like a CSV, but eventually maybe you're connecting directly to the database or the data warehouse so you can set all these connections up and you can have these suites execute every time you make a change to a code or every time there's a new with a new version of the data coming in. And you want these you can actually automate these to run on your pipelines as well. So just a very powerful tool. I think a lot of the tools today around testing enable this kind of orchestration and the last thing I mentioned about this is in the course right, testing is something that you do regularly, but it's hard to teach something that you do cyclically in a linear fashion. But I do talk about what this looks like in production. After kind of v1, maybe you're testing on the data set that you have right now, that's not where most teams actually have their tests. The the most mature teams actually put their tests at a much earlier point. So I'm, by the way greatest, there's also a lot of documentation that's generated as well. So production this is actually what it looks like. So instead of testing in your in your specific machine learning repository you're going to want want to put the bulk of the testing way, way upstream because your one machine learning application at this point is not going to be the only consumer of that data. So it makes sense to have a lot of the validation happened way upstream, for example, right after you extract it from the actual source. Maybe you have some tests that need to pass after it's being ingested using a tool like air, air byte or five Tran into a warehouse for example, maybe another couple of tests are executed after you apply some transformations, transformations, maybe other tests are also applied. So you have tests kind of applied after each stage in the ELT stack way, way upstream. And now, you know, my machine learning application, your machine learning application, maybe they all share the same data source for example, we can all benefit from these tests that have already run for us. And if we run way upstream, we can you can just kind of look at the reports and obviously our two applications may have additional tests as well. So those then you can run it your specific repository, if it doesn't make sense for everyone or you need certain things to look a certain way. So in general, every time there's a movement or transformation, it's a good idea to place a test there, because some things are just not in your control. And unfortunately not not everything is fully communicated. So this is a great way to catch issues before you know they happen much further downstream. Okay, I think we just have like a few more minutes. Real quick about models.

okay we have Okay awesome with models, I kind of split it into three different categories here. So what does testing models look like for training than the actual model itself? And then for inference, so for training, you know, the actual process of training, maybe you want to check things like the shapes and values of the intermediate and the ultimate outputs from your model. You want to check, you know, perhaps during a single batch that the loss is actually decreasing across the different iterations. This is a really good one that I think a lot of people don't do, but it's very easy to do just overfit on a batch, right? Any kind of model that you're developing here, you should be able to overfit performance on tests that may not be great, but that's okay. You're just testing that the logic works in terms of actually producing the kind of the loss and the pattern lost their train to completion obviously, actually, you know, run through and make sure things like early stopping and saving mechanism is all those things work. And this is a big one, make sure it works on different devices you know, maybe if you're doing something small scale locally, but then your your actual production will run on GPUs or GPUs run run the tests on that and there's amazing tools coming out. You know, this year, actually, in the next couple years as well, that make this switch of context, very easy. So it's not, you know, a completely different style of work for you to write scripts on the cloud or run the same tests. It's becoming sort of like the infinite laptop, if you will. So a lot of amazing work happening in the testing space. And again, in the in the course, we apply a lot of the concepts from the code in the modeling here as well to make it more streamlined. There's also the concept of, you know, actually testing your model itself. And this domain is huge, right? Depends on your model. your specific application, our task is NLP here. So I'm sure a lot of you have heard about this, you know, behavioral testing for NLP models. There's kind of three big pillars here, right invariants. Here, basically, any change that I do, it's the type of changes shouldn't affect the output. So for example, if I changed this, these words here, for example, Transformers applied to NLP have blanked the ML field. For my task, These things shouldn't change the output. And again, I say for my task, because if your task is something else where these should change the output, then this is not invariance. But these are just different types of tests that I write for the model itself. And if you notice here, it doesn't matter what the model is here, right? The model could be deep learning could be roll based, could be anything. These types of tests are agnostic to the actual model itself. And they you should they should be always passing so almost like sanity checks here that you want to answer. And again, these can be part of your actual testing suite, so that they're all being run every time I make an update. And these aren't things I'm running manually. Obviously a lot of adversarial testing you can do as well. And then once you have these kinds of tests, again, we parametric size these so you can run them pretty easily. Once you're done with the training bit the actual model, but the last big section is around the actual inference. You want to be you should test that you're actually able to load the different artifacts that you've created. You should be able to test that you can run simple predictions. And you know, let's say if it's a if it's a REST service here that actually goes through and we use make file to orchestrate all this but later on in the course we actually use for example airflow to make sure these things actually run and GitHub CI CD to enforce that they run without us manually executing the tests. But okay, so that's like a quick whirlwind tour of testing. Obviously, there's a lot more ties to testing or monitoring, etc. I'm happy to answer a lot of questions. I'll paste the links right now as well for all of this. But definitely check out the course as well. It's it's all free. It's open source. And I keep it up to date as i i work with a lot of companies in different industries now. And different scales as well. And I keep this stuff up to date, in a way that speaks to all the contexts. And then obviously everything is code as well. So I think it's really important to implement these things. So you can see what it looks like in practice. I'll stop the share now and I'll share the resources. Now I'll start answering questions as well.

ML for scientific insight

Building ML models is easy; answering science questions with them is hard. This short talk will introduce common issues in applying ML, illustrated with real failures from astronomy and healthcare - including some by the speaker. We hope sharing the lessons learned from these failures will help participants build useful models in their own field.


Thanks, Jesper. And thanks, everybody for being here. Can I just ask real quick actually, just before we get started, can you give me like a tick in the chat if you are a scientist like a literal like academic, I'm doing science scientist. And across if you're just like anything else, like you're a product guy, you're a data science guy you're building stuff actually being useful, etc. I'd be curious.

Quite a few. How are we lucky Jesper actually Barnstaple. I'm screenshare chat, okay, a lot of crosses. Okay, so we're product people, right? Well, hopefully and that gets I'll frame myself a little differently. So, that aim for this talk was John asked that question, to kind of illustrate I think, some issues that I've run into in my work trying to do science science with machine learning, and how that can be uniquely difficult. I think lots of the computer science research and the data science stuff we talk about focuses. In academia, it focuses a lot on like benchmarks, you know, image net, getting your paper and neurips in conferences like PI data, as I think there's poll, you know, quick poll. kind of shows it tends to focus a lot on the commercial applications. I'm guessing many of you like people at companies building things and a lot less so on the science and so scientists often take the kind of advice aimed for computer scientists and product people and then fall into a few traps that that advice wasn't really designed around. But since you guys aren't really scientists, I think I'll just change the words around a little and just kind of explain I think, what's, what's different for scientists, and maybe how some of these issues might also affect the kind of products you're building. Very quickly, who am I? My name is Mike Walmsley. I am a researcher at the University of Manchester. I'm an astronomer. I'm the Lead data scientist for a project called Galaxy Zoo, which uses volunteers to classify very large numbers millions of galaxies we have hundreds of 1000s of volunteers contributing their effort to saying oh, this galaxy is smooth, this one's featured etc. And it takes you down and ask different questions about those galaxies. And I'm ultimately trying to answer questions about how galaxies work. So for me the model is a is a means to an end of answering a science question. This talk is very much about getting to that end, and the ways the model can go wrong. I know that most people really aren't astronomers as well, but I think this can be quite common. So also, for each kind of issue that comes up I tried to find where this could also be a problem in the case of COVID. Very briefly, these are the issues that I want to talk about, kind of arrange them loosely on a scale of well, this is like a general ml problem that you really think would expect. affect everyone to this is really specific to the cultural context of science. And I'll just run through quickly because it's a short talk, and we're a little over time. So the first one is shortcut learning, which is essentially when the model works, but not in the way that you want it to. So as a quick example, here's two pieces of astronomy data from a radio telescope called chime. The aim of the model is to tell the difference between what signal is human junk and that's this kind of one on the left with what is a real signal from space from these mysterious extra galactic sources we don't understand. By I think the the obvious thing that you would like the model to tell us well, this is a vertical strip of some particular shake. Well, this is kind of junk. But actually a very good way to tell the differences simply saying, what is the standard deviation within this image, and you can be an almost perfect classifier between the two. So in that way, you can solve the problem but not in the way that you actually want. And as scientists, we need to get that causal. effect out because when we then try the model on new data sets fainter data, we need to make sure it still works. And I think that's important for product people too. Because again, if your model is working for like a coincidental reason, that isn't going to generalize well to new datasets. Second one is training test leakage. So this is where you're where you know building kind of on Valeria is talk where you're careful to partition up your dataset in interesting ways. The real world often doesn't fit this nice kind of row based idea of data. So for example, here are four simulated galaxies from a real astronomy paper by Chip janovich 2020 saying, dividing galaxies into different classes. But actually, they're not four different galaxies, therefore views of the same simulator galaxy. So if you do your training test whether your cross validation on this data, then you'll be mixing up different views of the same galaxy between your train and test sets. And your model can kind of cheat in that way. And this really happened in that paper. Although it was fixed and follow up work. And I really recommend actually, if you're interested in deep learning and astronomy then Jeff, Briana, which is work here, and in general is a really nice example of how to do it. The third one is around what I call begging the question and this is where your model will discover it by where you will see that your your models predictions depend on some feature and you write it up in your paper like oh yeah, this is the important thing or I guess in a product context, if you are as a data scientist trying to understand like what is driving your conversion rate, say you find Oh, this thing is what's driving conversion events. But actually, that's not what's really happening. What's happening is the you're rediscovering the biases of your model. So to make that concrete, in astronomy, it's generally true that spiral galaxies are pretty blue, and elliptical galaxies pretty red. That means that the model can can learn then just based on the color information that okay, if this is blue, and I don't know what it is, I'm going to call it the spiral. And then when you look through your predictions, you say, Oh, look, the machine learning is finding all these blue spirals like spirals a super blue, like kind of right but also wrong because you've biased your classifier in the first place, right? And you're just rediscovering what your classifier has learned from your general population. This is very tricky to fix, but it's mostly around controlling the information that you pass to the model. So for example, we might just pass a grayscale images here, or we might shuffle the chapter. The last issue I wanted to highlight was what I call sandcastles. And this is really a cultural thing. I think it's not a code problem. So in astronomy, we tend to make many many models for classifying galaxies and in COVID. There's a lovely review by weinan, where they found some 500 Odd models produced for essentially diagnosis or prognosis of COVID. And it turns out that the vast majority of these were never validated independently. People were just throwing out hundreds of models. There and then they were useless, right. And it's largely considered a massive failure of our industry that we were not very helpful, broadly speaking, despite big promises in addressing COVID. And the reasons for this, and I think this might be controversial with Justice talk, is that there are very minimal motivations. Really, in the short term for doing reproducible, careful validated science right in the sense that I need to get papers out and need my grant application funded. I need my conference talk. I need to keep just practically progressing my science. I don't really have time to be like carefully benchmarking everything or making it a nice Docker package. So my advice to you around there is is apart from trying to take slightly longer view for the reasons Jesper said Just be honest in what you're trying to do. If you're a scientist and you want to be a data scientist, that's okay, right data science papers, but if you're trying to do genuinely push the limits of human knowledge. Then maybe the best thing in this imperfect world is to try to use a few simple tools like put it on GitHub, make a run or Jupyter Notebook for user requirements, etc. There's some really low hanging fruit that Jesper was that Jesper went through and other people are going through shortly like Gaku I think that might help you get there. So that was a really quick run through of a few issues that I found in my work, doing science and machine learning, and I think they might be relevant to people who are trying to build products but also ultimately trying to find the answers behind why the models work well, because that is in a practical sense. Science. Thank you for listening.

Evaluating Machine Learning Models

In this talk, we will introduce the main features of a Machine Learning (ML) Experiment. In the first part, we will first dive into understanding the benefits and pitfalls of common evaluation metrics (e.g. accuracy VS F1 score), whilst the second part will be mainly focused on designing reproducible and (statistically) robust evaluation pipelines.

The main lessons learnt and takeaway messages from the talk will be showcased in an interactive tutorial, with applications in biomedicine.



That was a great introduction and lots of things to follow. Follow on. So please let me know if you don't hear me well. I'm going to share my screen in the meantime. And I will also try to monitor the chat. If there's any burning question on please raise your hand feel free. No worries at all. Right. So this I'm going to tie myself because timing is always the essence here. So this part is about evaluating machine learning models. And and in this presentation, essentially, I would like to argue give you a little bit of introduction of the basic components that would that you would need to evaluate your machine learning models. So let's starting from the very beginning, I just want to tell you, you know, let's put some, some terminology here to understand each other. But its basics. If you're familiar with machine learning. Essentially we have some domain objects where we extract some features, which is the relevant information we use in the model. We also call them the data and we have some tasks to solve the output is the prediction generated by the model. And in addition to the model, the model is essentially an instance of a learning algorithm and the whole combination of learning algorithm and the eastern side of the model is the learning problem we are trying to solve and depending on how this training data is performed, you can have supervised learning or unsupervised learning. For this for this talk, we will focus on supervised learning without any actually loss of generalizability So, I have essentially two main objectives with this talk. The first one is to provide you some descriptions of the basic components that are required to carry out a machine learning experiment and we will understand in a second, what do I mean by that? And what I would like to provide you is an overview of the basic components, not just recipes, so just like to understand the very, very basics and and the second goal is to give you some appreciation of the importance of choosing measurement in the appropriate way, are in other way. In other words, sometimes just measuring the accuracy which is seems to be the most used metric is might not be a good idea might not be the right measure to do to use. It's not always the case and we'll see. We'll see. We'll try to understand what's what's the point that so, in explaining a bit of machine learning experiment, we have to we have to deal with different elements here we have the research question, which is what we trying to achieve. We have the learning algorithm in particular, I tried to play a little bit with the colors. So let me let me let me know if that's clear or not clear at all, feel free to interrupt me. We have the learning algorithm. So we have the algorithm and the essence of those algorithms. So I am going to be the models and we have some data data sets in general. classical examples, those research questions might be how does model M perform on data from domain D, for example, or a much other question actually could be how model M would also perform on data from a different domain. That's a completely different question to ask. Or, for example, which of these models so we have different instances of the same algorithm has the best performance on data from D? And or another kind of question might be which of these learning algorithms keeps the best model of on data from the all this question, it's actually pretty standard question we tend up in asking ourselves when we set up our machine learning experiments, and if you're wondering the difference between the last two questions essentially is like, how do I play with hyper parameters in order to understand which is the best distance learning algorithm, I want to use my data? And then the second and last and the last question, on the other hand is which is the best model on on my date? And there's there's an explanation for that and the very last slide about it, and the reason why we want to do that so um, in order to set up our experimental framework, we need to investigate essentially two main questions. There's actually three but we in this talk, we're gonna go for the main two. And the first is what we need to measure, first of all, and the second is how we how we measure it. The the last question is actually how we do interpret the results. That will be the next step. We don't have the time to call it that. But in essence, in other words, the last question means how much risk, how much your results are robust and reliable in order that you can trust them, even from a statistical point of view. So let's start with what to measure and I want to start this with with the very basics and we're going to talk about the very you know, where everything starts, which is the, the so called confusion matrix are and for those of you not familiar with this one, let me just introduce a little bit the basic components, these slides are going to be a little bit dense, but play with me a little bit and let me know if that's not clear at all. So the confusion matrix, we're consider a binary classification problem. Also for the rest of the slides. So just with no lack of, of generalizability but still, to make things easier to understand in my opinion. So we start with in clockwise order this this matrix is actually composed by four different components. We have the true positive, which are essentially the positive samples correctly predicted as positive we have two classes so positive and negative in general. We have the false negative which essentially are the samples which have been wrongly predicted as negative, they should have been positive. And the sum of those is the condition positive. So the total number of real positive samples we have in the positive class the total number of samples we have in the positive class. The true negative is the negative samples correctly predicted as negative and the false positive as the negative samples wrongly predicted as positive. So and the sum of those is the condition of negative or the total number of negative samples we have in our dataset. The sum of everything in this matrix is t. So p plus n is the total number of samples we have in our data set. So these are the four essentially basic components on which lots of lots of different metrics are built upon, and and essentially derived from so first of all, we have the potion of positive so it's the ratio between P and T and the portion of the negative is the one minus POS or the portion of n over over t. So now we things on starts to get interesting, in my opinion. So we have the main primary metrics. And so all the other metrics we're going to talk about are essentially derived from from this ones. We have the true policy rate, also known as sensitivity also known as recall, for some reason in the literature. People over the years tend to call with the same with different names the same thing essentially, recall or sensitivity depends on actually the context we're talking about. If it's, for example, machine learning or information retrieval, you might have different terminology, but they do refer to the same thing. It's actually the rate of true positive cases. So TP over other P. I tried to highlight in the matrix the squares we are considering every time in the case of true negative rates is essentially true negative over n. And the confidence, also known as the precision and from information retrieval is the ratio between the true positive over the true positive plus the false positive. Now we get through the first with the secondary metrics. So we're not going to cover all of them all again, just to cover two slash three of them. And we're going to see the difference this week because one of the main point of this part is like, are those metrics really the same? Or is it really always the it doesn't really matter which metrics are which metric are we going to choose, or better warded? What's the difference between choosing one metric over the other so I probably look secondary metrics as also known as f1 score is essentially the mnemonic mean between precision and recall. Also known as the sensitivity or specificity, sorry, the confidence. The other and the super popular metric we lots of people actually tend to use in the evaluation is the accuracy and the accuracy. Looking at the confusion matrix is essentially the ratio between the good predictions so the true positive plus the true negative over the sum of positive and negative that's the accuracy essentially. So at this point, well, there's there's also this one, I'm Bear with me, another minute. This is the very important metrics I want you to use. This is on the on the spectrum of the little less or less famous ones, but actually, in my opinion, in my experience, this is one of the most effective so the Matthew correlation coefficients MCC is actually differently from other secondary metrics. There's still a secondary metrics on different from all the other secondary metric states considering all the possible components are in the confusion matrix. So let's analyze the components of this of this formulation. Essentially, we have the good results. So the good predictions minus the bad ones. And so the good the bad and the ugly ones is the ugly one, essentially all the possible combination of good or bad and in the denominator. So that's the idea of maturation efficient, so we're going to come back to that later. But that that's the metric I want to I want you to work with. So let's let's start by asking is accuracy a good idea? So in general, when we have some data for evaluation, so that's the formulation of accuracy on top of the slide so we have some data for evaluation as representative for any future data, right? So imagine we have some data that we want to use to evaluate a model. I haven't told you yet where this data come from, imagine someone is giving you this data, which should be in general the case nonetheless. The model may need to operate in different operating context. And the operating aim of different operating contexts might mean for example, different classes tribution. So if we do treat the accuracy on future data, as a random variable and bear with me, this is the only thing I'm a little bit more technical, I'm explaining the slides and we take the expectation as a random variable. So we assume in the same case, in the simple case, uniform probability distribution of the portion of positive essentially we end up understanding that the are accurate or better on measure. So this one is the average recall what we call the average recall essentially is like true positive rates, the half of true positive rate plus the half of true negative rate. What this all means is that if we don't know what's going to be the future operating context, if we choose the accuracy, we're essentially assuming that we're we're assuming that the data we're using to evaluate the model are going to be the same distribution of the data we're going to see in the future. So it's a huge assumption we're doing if we don't don't know anything at all, maybe recall is more, you know, it's probably a better, better choice on this particular case, if we don't know anything about the future distribution. So let me give you an example. Let's imagine we have the same dataset, two different models. On the left hand side you have this confusion matrix. So we have 100 samples in total 80 Positive 20 Negative the first model and one is actually producing the results you see in the in the confusion matrix, the accuracy on the first case is ope point eight, the average of all your end is oh point 88. If you go on the second, on the right hand side, we have the second model on the same data, which is producing 85% of accuracy. So confer to m one seems a little bit better. But if you have a look at the average recall, we have the oh point 72. So if we just look at the the accuracy we might be tending we were leaning towards them to on the data, but if we, if we carefully look at the results, actually, it seems that no one is doing a better job in predicting of the negative plus, and this is one of the main issue we have with accuracy. We're just considering the positive classes. So the average recall is helping us in reaching this this, this point in better understanding what's going on. So is accuracy a good idea? Probably not really in this particular case. Let's have a look with the f1 measure I know. So, a recalling f1 measure is simply the Modic mean between precision and recall. So on the right hand side or sorry, in the left hand side we have model M two on a performing on D. In this case, I'm going to tell you that we're going to switch the dataset not the model. So we will fix the model right now we change the data in this particular example. And what I want to show you here is like if we taking those results from this confusion matrix, the f1 is going to be no point 91 The accuracy is no point 85 Is the same results we got from the previous example. But if we move on a different dataset like this one, instead of having one under sample we have 1000 sample and in particular we have lots of negatives that we didn't have in the previous example, the accuracy is going to go to the roof 9.9 99 whereas the f1 measure is still staying on no point 91 is essentially the same thing. So what's the lesson we can derive from this one? The thing is f1 is has to be preferred on the in domains where negative events and most importantly, negative or not the relevant plus we want to classify. And so the takeaway message from this example is accuracy. In the first example accuracy was essentially taking a look at just the correct predictions, but it was not considering at all the negative classes. In this particular example here we have lots of negatives, but negatives are not the interesting bits here. And in this particular classification problem, we want to do better on the good ones or sorry on the positive ones. So if we have a look at just accuracy will actually we definitely go with with them too. But if it we might we might say no actually 99.9 to one is the right measure. And I just want to clarify here. What could be an interesting case our negatives are here. So consider that precision recall. f1 are all coming from information retrieval, information retrieval, you're interested in the right results returned by a search engine, not the negative ones. So typically, that's the case in which for example negative are not relevant at all you're not interested in in the wrong results, your more interesting results you'd return from your search engine query. Okay, so what's the case in which these is not the case? So like, we do want to account for all the possible classes. And so let's have a look at this example. Back Back again, it's again, on this particular for confusion matrix we have, again, one under sample, and this particular case, we're doing a very corner case but could be could be could be the case most of the time, if we don't account for classes tribution properly, which is this classifier and to in this particular case, is doing a brilliant job in classifying the positive one is doing a terrible job in classifying the negative ones. So we have no negative at all predicted. So if we can't if we can't if it comes to f1 We have no point 97 accuracy, no point 9895 AMSEC is undefined. And this means that this there's something wrong here we can't actually predict predict any performance because it's totally wrong. On the other hand, in this other case, we do a different with a slightly better on the negative but still, there's still confusion on what's going on there. As in we are doing a good job on the positive as decent job in the negative but still we are missing some errors on the negative if you do the counting here. The still the f1 and the accuracy are over the roof so they're very high. So you might be tempted to say okay, my model is actually doing pretty well. But the thing is, we're we're doing wrong on the negative class and the negative class counts in this particular case. And and so the error on the negative must be weighed differently, because that's the list represented class. And so the MCSE is telling you well, it's only point 14 is not going to go anywhere. I should probably mention that the MCSE different from the other metrics is a metric which doesn't go from zero to one goes from minus one to one. And essentially you have to read this metric in this way. So minus one is completely in means you're classifying the clusters in a completely opposite way. So positive from the negative and the other way around. mccf zero means random prediction, and MC is your one mins are good prediction. So takeaway message from here. MC seem to be referring in general when predictions on all classes count. So takeaway lessons are essentially children already on and but the main takeaway message I want to convey here is when you do an experiment, don't just record accuracy, instead keep track over the main primary metrics and the secondary metrics can be derived afterwards. Okay, so now the second question is how to measure things. Are we focused for the first part in the metrics side net? Let's focus on the how we do prepare the data to do that. So, in evaluating the supervised learning models, in general machine learning models it the whole process might make seems quite straightforward. You have to train the model and calculate how the model is doing on choosing the metric you want. You want to calculate how the model is performing? Well actually, this is not the right way of doing it, because this would be an over overly optimistic estimate, also known as in sample error. What you want to do instead is to your goal is not to evaluate the model on data you have right now in the training data, you want to overlay your model on data the model has never seen before, or in other words, data the model will see in the future. So and so that's technically called the out of sample error. So how do we do that? One simple and clerical way of doing it is the holdout evaluation. So you have your your whole data set. You have the training set, and you split this in into a partition. Partition means there's no overlap between samples in one set and the other set. You have the training set and the test set, also known as the holdout set, important message here, and I'm going to repeat this quite quite often. Later. Today. The test set has to be put aside. So the test set is something you generate in the very beginning of your pipeline. So you start by tape saying okay, this is my data, I want to take away some part which I will be used, I will be using later. But the test set has to be taken outside and and you have to forget about it. So the test set has to be retaken back when you're doing the actual evaluation. From from now on when you have the test set, the training and a test set you take you do everything on the training and the rest of it and only the evaluation of the last step on the test set. So again, the test set is only for performance valuation, nothing else. So to do that, we can simply split to train and test is a simple Python function. It can be a utility function from cycle learn, train to split to do to the job. And that's it. The problem is that this kind of a train test split works but it's a sort of a weak our way of evaluating performance because it's highly dependent on how you select the samples in a test partition. And remember, we're considering this third position as representative for future data. So we say okay, this is the data we want to evaluate a model and this is going to be a good test because the model in the future are going to see data which resemble what I'm using right now and then the test partition. So what can be a better way to do that we can do a slightly more sophisticated partitioning process, which is called cross validation. The idea is that we do generate generate several test partition and we use them to assess the model so we have the data set, which then we'd run them to split the data into K partitions, also known as false. And then we use this false in term. So for k times, we fit the model on the k minus one petition, the blue ones you see and the pink one are essentially going to be used as evaluating petition. So the test petition and validation petition and we repeat this process all the time, on M on k different models. This is the important bit to remember all the time you have a brand new model you're training on a different partition and then you average over the different fold and so you have chosen the metric arm and in the beginning, you can average the value of the metric over the different partition and you have some indication of how the model is performing on different versions of training and test British remember the deal is always the same with the test data test files must be must remain and seen on the model during the training. So while once you generate the test partition, you have to use it only in the end to evaluate the model, whichever feature selection whichever processing you want to do, you have to do it on the on the train test and then apply on the test set. K can be any number the different values and literature can can be five or 10. Or if k is equal to n the sum of the total number of samples This is also known as leave one out cross validation. cross validation can be repeated which means you can change the seed and and repeat this process multiple times although you're increasingly violating the independent and unit and and uniformed, independent and what's the other one? independent distribution identically Thank you very much identically an independent distribution of the sample you have and customization can be stratified. You can which means you want to maintain the same class distribution among training and test folds. And when you have imbalanced data set and or if we expect a learning algorithm to to be sensitive to the to class distribution, and second learning factor, you have a long list of different methods you might you might choose to do cross validation, our group K fold, shuffle, split and many others. So you don't have to reinvent the wheel. And I want to just clarify in this last bit, that there's a common mistakes in which sometimes you want to use cross validation to do model selection also known as hyper parameter selection. This is methodologically wrong as parameter tuning should be part of the training. So test data shouldn't be used at all in parameter training. And I've ever been to tuning where you can do actually is like you do your petition. So you generate your test set, then you again you forget about it. And then you take the training sets, you split them in training and validation and so you do your hyper parameter tuning on this new training set and then you validate this values are so there's hyper parameter tuning on the validation set. And once you're done, you retrain everything on the full training set you added in beginning and then you validate you you assess the model performance on the test set. That's a methodologically sound way of doing on either parameter tuning on the on the on the actual data. The last thing I want to mention here is the no free lunch theorem. And so the fact that it doesn't exist in this has been proven long time ago actually, that for some data, the model working better from some other data. There are other models working better than the previous one. So essentially doesn't access the unique model which can do can rule them all. So you have to try multiple times as Jesper was saying before random forest is the baseline and and you don't have to use you don't have to shoot with a canon of deep learning for example, if you just linear model can do that. And so that's why you sometimes need on cross validation because cross validation provides you a robust framework to do this choice. So once you have your your experimental framework set up, you can essentially change whichever model you want and repeat the same process and use cross validation to see how these different models perform on different versions of test, train and test our training validation dataset. So this is the link to the to the repository in the interest of time. I think that I just want to show you the basic Yes, I'm almost running out of time. So I just want to show you the basic of what we've done on the repository. You can find different notebooks in the repository. There's different notebooks for each of the steps. Yes, but mentioned. I'm going to focus very quickly on the first two but mostly on this one on module evaluation. In the data set in the notebooks, we're going to choose our we're going to use this data set which is a sort of a toy data case. So the penguins dataset, and it's a classification problem. We want to we have some features, we want to classify the different classes backwards in the first notebook. So data preparation. We're going to look at the data and doing some analysis on the feature. We can do that because we have a limited number of features and afterwards we clean some data we remove some not with some Nana from the datasets not particularly useful. And then we do we start our preparation by doing some feature pre processing. The only takeaway message from this part is remember that whenever you want to do any pre processing, always remember train test split first and then you applied the feature processing as in. You don't have to pre process your test set. You have to apply the results of the processing or the trading to the test set. That's a different thing. And that's you know, to get ready to train our data and in the evaluation in the whole parts. You we're gonna we're gonna see examples of data splitting then cross validation. Stratification is one important aspect of data splitting when you have different clusters tribution when you're splitting in training and test you want to keep on the distribution of the classes in both the training and the tests. So you want to retain the same sort of same class distribution. In other words, when you're sampling randomly when due to splitting, you're not sampling in your whole data set but you're sampling in the class buckets. So this is better for future future stability of the predictions. And this This is sort of the case. So we have one class which is less represented. So if we just run them split the data we might end up not having any sample at all, in the Indus test set which is not robust at all because we want to evaluate and all the possible samples we have. But the probably the most interesting bits on for me is like this module evaluation. Bits. Yes,

Oh, brilliant. Okay, so I'll stop here and let's go with the with the questions. All we can do late we can do in the end as you prefer. We could probably do in the end. I will keep track them. So I don't still time to

That's totally fine. I just want to say that if you if you look at the NOC book on it, we had a quick joke about it because since we're using we're using a toy dataset, I should find a way in which I should broke it to prove the point I was trying to make. But in essence, in this in this choosing the appropriate evaluation metric section in this notebook, what I did was choosing two classes. So the the most representative and the least representative class, so we just have in two classes again, binary classification to make it simple. Not because there's any limitation of the metrics doesn't make it simple. And then I chose the features which was the most confusing one. So I looked at the different features we had and said look, if we do classification, choosing this feature, this two combination feature so the flipper length and the common deft. This might be quite hard for a model to figure out which which of the two classes might be and so what I did was essentially proving the point that with this particular setting of the data, the same data just filtered a little bit. If you use in cross validation and then evaluation, if we use the accuracy and the MCC, you end up with very different results in accuracy, which again, remember it just consider the good ones. So not the neck, the bad ones, as the MCC does. If you look at the accuracy in the average accuracy and the cross validation, you see no point 70 Which seems reasonable actually it's definitely not because it's we're doing lots of lots of errors on the negative here. And this is something we should spot in evaluating our model. So yeah, there was a point but you'll find the details in the interest books over had more than happy to go through afterwards on Discord later today. Not no now. Let me have a look any.

Why and how make ML reproducible?


The overview talk serves to set the scene and present different areas where researchers can increase the quality of their research artefacts that use ML. These increases in quality are achieved by using existing solutions to minimize the impact these methods take on researcher productivity.

This talk loosely covers the topics Jesper discussed in their Euroscipy tutorial which will be used for the interactive session here:

Topics covered:

  1. Why make it reproducible?
  2. Model Evaluation
  3. Benchmarking
  4. Model Sharing
  5. Testing ML Code
  6. Interpretability
  7. Ablation Studies

These topics are used as examples of “easy wins” researchers can implement to disproportionately improve the quality of their research output with minimal additional work using existing libraries and reusable code snippets.


Why and how do we make machine learning reproducible? Especially for scientists? Well, so we can make a huge impact with machine learning. Essentially, we have the ability to make things faster, do things that were never able before like right now everyone is wild about stable diffusion where we can write a sentence, and then suddenly, things appear like images are generated. And of course, every scientific discipline has machine learning workshops at the moment that are just like, oh, yeah, and you can do this and you can do this faster and these revolutionary insights are able are possible now like the protein folding that has been done so we can really do things that were never possible before which gives us well, great power but also great responsibility. And of course we're doing science, there are scientific results. And in theory, scientific results should be reproducible. But the reproducibility crisis has been ongoing even before machine learning, but machine learning really has the power to exacerbate that and make it worse, and we kind of don't want that but if this was like we're scientists, we should do that. If that was the only reason then yeah, we wouldn't have this workshop or a lot of other people talking about this issue. So what are other reasons we want to make a reproducible? Well, human progress. So if we build our model and we do this cool thing, and people can take this model and build upon it, that means we are now progressing much faster because they don't have to figure out how to do this from abstract paper. I think everyone here who has been working in science or machine learning has seen those papers that are just like a vague description, which often is just a nice narrative around how this model could have been built. So you have no access to data, no access to the weights, no access. To the training scripts, and that's a nightmare to reproduce. And of course, ethical work. We have to check that our work is ethical and can Yeah, it doesn't doesn't negatively impact everyone. involved. And reproducibility really helps there because it's adjacent to open source work, where we can really have a look at what is the training data, how were these models conceived? And yeah, what design choices were made in these models and of course, funding bodies will want this more and more, especially if you have like this blackbox model and you say, oh, yeah, I found the scientific insight and just say, oh, yeah, and that's it. A lot of funding bodies aren't that happy with it anymore, because there's like a divide between the engineering and the progress and most funding bodies want to fund progress in science and not just this one thing that you can then build an API around. So there's there's that requirement that is happening more and more. And of course, it's easy to reuse. So that is, other people can use it much easier if you make a reproducible. This can build companies this can build more scientific impact, like I said before with human progress, but it's also a gift to ourselves. Because if you've ever gotten back to your code, and they'd like just a year and look at that code, a lot of people despair. I have despaired before before I got better at these practices to really well build better models and write better code to make it reproducible. So that machine learning and bad signs unfortunately hurts people. So we have to work to make these models as transparent as possible, and make them reproducible so people can check the work but also people can use the work to then go on with it. Perhaps, like that's big words, and I myself have been struggling with this a lot. Like how do you actually do this? Well, we can start with a model evaluation. I won't go too deep into this to not step on malarious toes, but essentially we have to ensure we have valid results. If we don't do proper evaluation, like for example, if we just train on the entire dataset, and we overfit the model to the dataset, write a paper based on that. That means we now have a paper that hopefully doesn't get accepted because Reviewers should reject that. But this PayPal has taken first of all, a lot of work, obviously so if it gets rejected that's sad for us because we put in all the work, but also if it got accepted that is now a scientific result that reports Oh yeah, we can do machine learning with this on something that actually wasn't possible, because the overfit model actually doesn't generalize to new data. It's just on this dataset, this subset of the actual world of the real world that we're looking at. So we have to reject this to be able to well work with actual insights. And of course, we can reduce the impact of like dependence with model evaluation like I have a nice image of a satellite here and if you use satellite data, it's dependent spatially because we have maps essentially, and it is dependent in time. So we have to look at that in model evaluation, and of course, address real world class distribution. Rarely do we actually have like, homogeneous classes in our our data, but that's enough about that. So how does that well, Valaria we'll talk about it and I wrote the mini ebook that you can get on my website for that. It's free if you sign up to my newsletter. But essentially I go through all this like in one, page four for each of those. And of course, we have our GitHub repo where all of that is documented with code Valeria has put in a lot of work to make this Jupyter Notebook really high. Like high impact. So the next one is benchmarking. for collaboration. This is really important.

We can use dummy classifiers as the basis and use domain solutions like in weather for example, we have weather bench, where we have the specific solutions and we can then compare it against each each other that is part of reproducibility where we can actually compare results on a set of data, where we where we can define the metrics and really say Okay, your new insight should be able to work on this. And, yeah, benchmark data sets like that. And of course, simple models are really important, like compare your model to something that isn't a deep neural network. We have this paper in seismology, where this huge deep neural network was published in Nature to predict something about earthquakes, I think the aftershocks but the problem was, you can do that with a linear regression if you just don't take the take the important features so we really have to use benchmarking to be able to compare that and domain solutions also means like use solutions that already exists, exist in the domain. Right now. We're working on a paper where we have a really cool transformer doing a thing and weather forecasting. And of course, we have to compare that to existing solutions, like there's a member by member solution that basically shifts all the weather forecasts individually. And that is like a very good approach like a very advanced approach. But if we compare our machine learning model to that, then we can actually say, alright, we're doing some good work here. So not just the super simple models, but also like domain solutions that work. And of course, yeah, random forest. Random Forests are my usual like baseline model, because they're nonlinear. They are a little bit usually a little bit better than linear models. But yeah, if a linear model works on your data set, you certainly don't need a fancy, big model. So, model sharing really important this is also part of later talks as far as I know. But yeah, you have to export your model share model checkpoints, the fix the sources of randomness, this is really important. Like pytorch, and TensorFlow and scikit learn all have a specific webpage about fixing all the sources where stochasticity D comes in. It can be a little bit more tricky because you have to fix it for pytorch and then for NumPy, and maybe even for Python, depending on which version you're on. So going back to the documentation for this is really important then linting formatting, use automatic linters use app. I'm a fan of black. I know some people don't like it, but like there are also I think there's like dark and gray and blue that are taking some liberties on the black formatter. So you can make your code really looking nice and standardized. According to the Wall, the Python standard that really helps in sharing your code with others. Like even if your code isn't the best written because we're all like scientists or researchers. So our code isn't to the standard of computer scientists that has written their entire life and C or whatever. But we can at least make it easy for our colleagues to also read the code and then work through the code. And yeah, docstrings are a really nice way to document actually what you're doing and you can build a documentation out of duck strings. So using those and if you work in VS code, which I love. You can install an extension that basically writes most of the doc string for you if you have type hints. So it just goes through your function signature and writes most of the doc string for you. So you just have to write the description of the actual inputs and outputs. So makes life really easy. And yeah, I worked with a developer that was in Java before who hated writing doc strings. But when I showed him that one and another extension that actually use this machine learning like codebook to infer what the variables might be, then docstring writing becomes really easy and your colleagues will love you for it because even if they use Jupyter Notebooks, they can just shift tab into your function and see what it's doing. Yeah, of course, fix your dependencies. This was actually funny when Valeria did the first pull request on what I had written before he fixed all my requirements files because I was lazy. I just had the name of the thing and there and now we actually have versions in there, which is a much better way to do it. But yeah. So start start at least with sharing requirements files or an environment Yamo. But if you can pin the version of it, then it's even better especially in machine learning, because I have seen some research where people were doing all all sources of randomness were fixed. And they just ran different versions of like GBM. It was at that point, and he got different results. Just because the version differed. So fixing the dependencies is really important for model sharing as well. And of course, the golden standard is Docker, which can be frightening a little bit. But we do have an example Docker file, which doesn't have to be that hard to be honest. In in our GitHub, so you can check it out and see it. And there's also people that try to make it even easier with a with a hole sharing. And then of course, we want tests but we have a talk on that. So I'll just go quickly through it like fix all your inputs and see that your methods are working determined deterministically take some data samples for tests, do doc strings, because you can write examples in doc strings. And then you can do the doc string test which is really, really neat. That's like the easiest way to write the first test. And of course, validate your inputs. And yeah, interpretability is a two like it's a two edged sword in a way. Like there are some interpretability methods that have fallen out of favor a little bit, because they only interpret one sample. So they are nice for publications because you can find the sample where your network looks the nicest, and say, oh, yeah, this is Explainable AI, but it just explains that for one sample that has been run through your through your model. So yeah, but there are tools that are good second learn has some of them like the permutation importance and partial dependence plots, which really explain how important that features are, which already is an interpretability tool and I've used it a lot in my work where I work with domain scientists, and really try to work together with them to be able to basically build something together and show them how the model interprets their, their data. And that is a good like sense check to see if if the model puts importances on the important like the domain expertise, important features. Then you can use sharp but I have a love hate relationship with sharp at the moment. If you if you're on Discord. After Ian Oswald's talk, I will say it's Loki unusable right now. There are a lot of problems with a dependency and a thing that is only one developer on it at the moment. So it's a little bit if you can make it work. It's amazing. But right now it's not working out of the box, which is sad because it's such an amazing tool. And it used to be so useful, but right now it's just yeah, a little bit darker written. But when that comes back, definitely use it. Use model inspection itself. So see what is going on in your models. And yeah, then communicate results. So make sure that you can build good rapport with domain scientists so they can actually understand what you're doing. No one is actually interested in your accuracy number because you can also use it to cheat right 99% accuracy doesn't mean anything if you don't know the data. But be sure to yet to build trust, essentially. So you want to build that intuition about the model about the data. And yeah, build a good rapport with domain scientists. And then the last one I am quite fond of is ablation studies, especially if you publish your your models, because we all know and data science a lot of is very iterative, like you try this, it doesn't work then you try that it doesn't work. So we kind of follow this optimization path, but we don't know if it's the optimum if it's the best model for for everything. So what we actually do is we remove components from our solution. In the notebook. I've been cheating a little bit because we have such a toy model that I didn't build a deep neural network where you can switch off like your probabilistic part or this part. So I just switched off the standardization to show that you can show that the standardization was really important. So it's a mini ablation study. It's kind of cheated, but just to show an example of switching off different components in your machine learning pipeline in your workflow. So you can actually argue that this was important to include, and yeah, that it's not spurious in the end. And yeah, that shows the impact of your final pipelines. So where do we find this? Well, our website we will minus ML dot XYZ where you can get the GitHub as well. It should also be linked in in the page that I never got your name, our moderator. Sorry about that.

Opening Talk

Opening the workshop


Yeah, thank you for showing up. This is really cool seeing so many people in here. I think we put together some some quite nice Talks today. So let me start with an introduction.

Who's going to be speaking today and the general schedule that we that we want to keep? I have a very short presentation for this.

I know this is shifting every screen around. Sorry about that.

Alright, so basically, we have a workshop website for this. This is, because the XYZ domains are incredibly cheap. So you can check all our talk material there, and there's also the link to the GitHub, which is way too long and it has my name and no one can spell my name not even myself sometimes. So there's the link right when you go to the website, and of course all the information like about PI data. And yeah, the speakers as well.

And speaking of speakers, we have some really cool people coming. So Goku is the creator of Made with ML, which is one of my favorite resources in ML ops. And just yeah, a great resource for for like, for anything that that goes beyond just training your model. It's really fantastic. Check it out if you haven't yet. Whenever I share it on my newsletter, people go wild for it because it's such a good resource. That will be a little bit later because he'll be tuning in from the east coast, which is the most terrible timezone difference that we could have for the Americas. So it is five in the morning right now for him. I told him six is fine. Then we have Mike. Mike is from the University of Manchester and we'll be talking about like scientific insights and pitfalls. And then we have fellow organizer Gemma Turon from a cilia and she will be talking about like, making machine learning work and experimental. Experimental pipelines, that's the word and then of course Valaria module at Anaconda. He's a dev rel, and we'll talk about my favorite topic, machine learning evaluation. And yeah, of course myself. I will just give a broad overview why we want to make things reproducible. And this is the broad schedule. This is the opening. I should probably be keeping time starting with myself then Valeria takes over with some practical insights. Then we have the first invite to talk by Mike we want to have a break possible a chat but just to break the two hours up because two hours is rough. Not just for us but also for everyone attending. So that's the coffee break where everyone has to unfortunately provide their own coffee. And then hopefully Goku will join as well. And Gemma is going to take us out of this workshop and of course we'll finish with lecture and discussion and audience questions how everyone does their session is up to every speaker so if it's totally interactive with Jupyter Notebooks, but I think we'll have a lot of like more like talk style presentations here because it kind of lends itself to that. And then we have five minutes for closing or five minutes for like time shifts. And I can run over because I'm the first one. So let's go over to the other slide real quick. That is going to help. Yeah, it's just I'm just sharing my screen so this should be alright. Let's go. So yeah, also, like of course this workshop the Code of Conduct from PI data applies and everything and yeah, you'll have to find your own emergency exits everywhere, but I hope nothing happens.