Why and how make ML reproducible?

Why and how make ML reproducible?


Abstract

The overview talk serves to set the scene and present different areas where researchers can increase the quality of their research artefacts that use ML. These increases in quality are achieved by using existing solutions to minimize the impact these methods take on researcher productivity.

This talk loosely covers the topics Jesper discussed in their Euroscipy tutorial which will be used for the interactive session here:

https://github.com/JesperDramsch/ml-for-science-reproducibility-tutorial

Topics covered:

  1. Why make it reproducible?
  2. Model Evaluation
  3. Benchmarking
  4. Model Sharing
  5. Testing ML Code
  6. Interpretability
  7. Ablation Studies

These topics are used as examples of “easy wins” researchers can implement to disproportionately improve the quality of their research output with minimal additional work using existing libraries and reusable code snippets.

Transcript

Why and how do we make machine learning reproducible? Especially for scientists? Well, so we can make a huge impact with machine learning. Essentially, we have the ability to make things faster, do things that were never able before like right now everyone is wild about stable diffusion where we can write a sentence, and then suddenly, things appear like images are generated. And of course, every scientific discipline has machine learning workshops at the moment that are just like, oh, yeah, and you can do this and you can do this faster and these revolutionary insights are able are possible now like the protein folding that has been done so we can really do things that were never possible before which gives us well, great power but also great responsibility. And of course we're doing science, there are scientific results. And in theory, scientific results should be reproducible. But the reproducibility crisis has been ongoing even before machine learning, but machine learning really has the power to exacerbate that and make it worse, and we kind of don't want that but if this was like we're scientists, we should do that. If that was the only reason then yeah, we wouldn't have this workshop or a lot of other people talking about this issue. So what are other reasons we want to make a reproducible? Well, human progress. So if we build our model and we do this cool thing, and people can take this model and build upon it, that means we are now progressing much faster because they don't have to figure out how to do this from abstract paper. I think everyone here who has been working in science or machine learning has seen those papers that are just like a vague description, which often is just a nice narrative around how this model could have been built. So you have no access to data, no access to the weights, no access. To the training scripts, and that's a nightmare to reproduce. And of course, ethical work. We have to check that our work is ethical and can Yeah, it doesn't doesn't negatively impact everyone. involved. And reproducibility really helps there because it's adjacent to open source work, where we can really have a look at what is the training data, how were these models conceived? And yeah, what design choices were made in these models and of course, funding bodies will want this more and more, especially if you have like this blackbox model and you say, oh, yeah, I found the scientific insight and just say, oh, yeah, and that's it. A lot of funding bodies aren't that happy with it anymore, because there's like a divide between the engineering and the progress and most funding bodies want to fund progress in science and not just this one thing that you can then build an API around. So there's there's that requirement that is happening more and more. And of course, it's easy to reuse. So that is, other people can use it much easier if you make a reproducible. This can build companies this can build more scientific impact, like I said before with human progress, but it's also a gift to ourselves. Because if you've ever gotten back to your code, and they'd like just a year and look at that code, a lot of people despair. I have despaired before before I got better at these practices to really well build better models and write better code to make it reproducible. So that machine learning and bad signs unfortunately hurts people. So we have to work to make these models as transparent as possible, and make them reproducible so people can check the work but also people can use the work to then go on with it. Perhaps, like that's big words, and I myself have been struggling with this a lot. Like how do you actually do this? Well, we can start with a model evaluation. I won't go too deep into this to not step on malarious toes, but essentially we have to ensure we have valid results. If we don't do proper evaluation, like for example, if we just train on the entire dataset, and we overfit the model to the dataset, write a paper based on that. That means we now have a paper that hopefully doesn't get accepted because Reviewers should reject that. But this PayPal has taken first of all, a lot of work, obviously so if it gets rejected that's sad for us because we put in all the work, but also if it got accepted that is now a scientific result that reports Oh yeah, we can do machine learning with this on something that actually wasn't possible, because the overfit model actually doesn't generalize to new data. It's just on this dataset, this subset of the actual world of the real world that we're looking at. So we have to reject this to be able to well work with actual insights. And of course, we can reduce the impact of like dependence with model evaluation like I have a nice image of a satellite here and if you use satellite data, it's dependent spatially because we have maps essentially, and it is dependent in time. So we have to look at that in model evaluation, and of course, address real world class distribution. Rarely do we actually have like, homogeneous classes in our our data, but that's enough about that. So how does that well, Valaria we'll talk about it and I wrote the mini ebook that you can get on my website for that. It's free if you sign up to my newsletter. But essentially I go through all this like in one, page four for each of those. And of course, we have our GitHub repo where all of that is documented with code Valeria has put in a lot of work to make this Jupyter Notebook really high. Like high impact. So the next one is benchmarking. for collaboration. This is really important.

We can use dummy classifiers as the basis and use domain solutions like in weather for example, we have weather bench, where we have the specific solutions and we can then compare it against each each other that is part of reproducibility where we can actually compare results on a set of data, where we where we can define the metrics and really say Okay, your new insight should be able to work on this. And, yeah, benchmark data sets like that. And of course, simple models are really important, like compare your model to something that isn't a deep neural network. We have this paper in seismology, where this huge deep neural network was published in Nature to predict something about earthquakes, I think the aftershocks but the problem was, you can do that with a linear regression if you just don't take the take the important features so we really have to use benchmarking to be able to compare that and domain solutions also means like use solutions that already exists, exist in the domain. Right now. We're working on a paper where we have a really cool transformer doing a thing and weather forecasting. And of course, we have to compare that to existing solutions, like there's a member by member solution that basically shifts all the weather forecasts individually. And that is like a very good approach like a very advanced approach. But if we compare our machine learning model to that, then we can actually say, alright, we're doing some good work here. So not just the super simple models, but also like domain solutions that work. And of course, yeah, random forest. Random Forests are my usual like baseline model, because they're nonlinear. They are a little bit usually a little bit better than linear models. But yeah, if a linear model works on your data set, you certainly don't need a fancy, big model. So, model sharing really important this is also part of later talks as far as I know. But yeah, you have to export your model share model checkpoints, the fix the sources of randomness, this is really important. Like pytorch, and TensorFlow and scikit learn all have a specific webpage about fixing all the sources where stochasticity D comes in. It can be a little bit more tricky because you have to fix it for pytorch and then for NumPy, and maybe even for Python, depending on which version you're on. So going back to the documentation for this is really important then linting formatting, use automatic linters use app. I'm a fan of black. I know some people don't like it, but like there are also I think there's like dark and gray and blue that are taking some liberties on the black formatter. So you can make your code really looking nice and standardized. According to the Wall, the Python standard that really helps in sharing your code with others. Like even if your code isn't the best written because we're all like scientists or researchers. So our code isn't to the standard of computer scientists that has written their entire life and C or whatever. But we can at least make it easy for our colleagues to also read the code and then work through the code. And yeah, docstrings are a really nice way to document actually what you're doing and you can build a documentation out of duck strings. So using those and if you work in VS code, which I love. You can install an extension that basically writes most of the doc string for you if you have type hints. So it just goes through your function signature and writes most of the doc string for you. So you just have to write the description of the actual inputs and outputs. So makes life really easy. And yeah, I worked with a developer that was in Java before who hated writing doc strings. But when I showed him that one and another extension that actually use this machine learning like codebook to infer what the variables might be, then docstring writing becomes really easy and your colleagues will love you for it because even if they use Jupyter Notebooks, they can just shift tab into your function and see what it's doing. Yeah, of course, fix your dependencies. This was actually funny when Valeria did the first pull request on what I had written before he fixed all my requirements files because I was lazy. I just had the name of the thing and there and now we actually have versions in there, which is a much better way to do it. But yeah. So start start at least with sharing requirements files or an environment Yamo. But if you can pin the version of it, then it's even better especially in machine learning, because I have seen some research where people were doing all all sources of randomness were fixed. And they just ran different versions of like GBM. It was at that point, and he got different results. Just because the version differed. So fixing the dependencies is really important for model sharing as well. And of course, the golden standard is Docker, which can be frightening a little bit. But we do have an example Docker file, which doesn't have to be that hard to be honest. In in our GitHub, so you can check it out and see it. And there's also people that try to make it even easier with a with a hole sharing. And then of course, we want tests but we have a talk on that. So I'll just go quickly through it like fix all your inputs and see that your methods are working determined deterministically take some data samples for tests, do doc strings, because you can write examples in doc strings. And then you can do the doc string test which is really, really neat. That's like the easiest way to write the first test. And of course, validate your inputs. And yeah, interpretability is a two like it's a two edged sword in a way. Like there are some interpretability methods that have fallen out of favor a little bit, because they only interpret one sample. So they are nice for publications because you can find the sample where your network looks the nicest, and say, oh, yeah, this is Explainable AI, but it just explains that for one sample that has been run through your through your model. So yeah, but there are tools that are good second learn has some of them like the permutation importance and partial dependence plots, which really explain how important that features are, which already is an interpretability tool and I've used it a lot in my work where I work with domain scientists, and really try to work together with them to be able to basically build something together and show them how the model interprets their, their data. And that is a good like sense check to see if if the model puts importances on the important like the domain expertise, important features. Then you can use sharp but I have a love hate relationship with sharp at the moment. If you if you're on Discord. After Ian Oswald's talk, I will say it's Loki unusable right now. There are a lot of problems with a dependency and a thing that is only one developer on it at the moment. So it's a little bit if you can make it work. It's amazing. But right now it's not working out of the box, which is sad because it's such an amazing tool. And it used to be so useful, but right now it's just yeah, a little bit darker written. But when that comes back, definitely use it. Use model inspection itself. So see what is going on in your models. And yeah, then communicate results. So make sure that you can build good rapport with domain scientists so they can actually understand what you're doing. No one is actually interested in your accuracy number because you can also use it to cheat right 99% accuracy doesn't mean anything if you don't know the data. But be sure to yet to build trust, essentially. So you want to build that intuition about the model about the data. And yeah, build a good rapport with domain scientists. And then the last one I am quite fond of is ablation studies, especially if you publish your your models, because we all know and data science a lot of is very iterative, like you try this, it doesn't work then you try that it doesn't work. So we kind of follow this optimization path, but we don't know if it's the optimum if it's the best model for for everything. So what we actually do is we remove components from our solution. In the notebook. I've been cheating a little bit because we have such a toy model that I didn't build a deep neural network where you can switch off like your probabilistic part or this part. So I just switched off the standardization to show that you can show that the standardization was really important. So it's a mini ablation study. It's kind of cheated, but just to show an example of switching off different components in your machine learning pipeline in your workflow. So you can actually argue that this was important to include, and yeah, that it's not spurious in the end. And yeah, that shows the impact of your final pipelines. So where do we find this? Well, our website we will minus ML dot XYZ where you can get the GitHub as well. It should also be linked in in the page that I never got your name, our moderator. Sorry about that.