Testing in Machine Learning

Testing in Machine Learning

What is testing ML and how it's different from testing deterministic code Why it's important to test ML artifacts (data + models) What testing data and testing models looks like (and I'll provide quick code snippets so people can see what it looks like) Concluding thoughts on how testing relates to monitoring and continual learning.


I'll give a quick background recording in progress. There we go. Awesome. All right. Good to meet everyone. Sorry if it gets a little dark here. The sun isn't up here yet. Alright, so today already talking about testing. So first, my name is Goku. Mohandas and it's just mentioned the founder of made with him out really quick about my background. backgrounds in biology slash chemical engineering so used to be a scientist Mike and then kind of transitioned to more on the applied side. Now doing a combination actually start up my career building a rideshare analytics app when Oberer and Lyft first started out, but to help kind of the taxi sectors combat against them, then went on to work at Apple mostly doing NLP and then building out their sort of initial ml platforms kind of standardize that now. Was there about three years and then transition back to health left with the head of Apple Health to start a company in the oncology space? And then we were there for about two years. And then after the acquisition of that company, I had to count to different contexts of applying machine learning in production. And I've kind of just wanted to share what that look like. So that was how I made with them now started. And it's a completely free open source resource for people to learn about how to put machine learning into production and I cover all the topics kind of from first principles. So it's not catered to just my specific contexts, but I've been getting to meet a lot of amazing people who are developing contracts that I'll personally never touch. Jesper being one of them. So it's become a great kind of medium for myself as well. And I'm going to try to keep it short so we can get back on time. But I will be talking about testing and instead of making slides I thought we can just quickly go through the actual content on made with a mouse. So let me get that for you guys. Everyone can see my screen.

Okay, so go to madewithml.com or just search for testing ml. And just I'm gonna cover sort of the high level details here. And I'll paste the links at the end as well. But I think we're gonna have a discussion at the end. So any questions? Let's, let's push them to that. And I'll answer them on the chat as well. But I think when people talk about testing, machine learning systems, there's a lot of overlap with the testing that we've had with software one dot O, which I'll call deterministic software. And there's a lot we can kind of take from that. But there are a lot of extensions that we can't directly take from that world as well, mostly because of these other components, right software, one that all has code, but now we have these probabilistic artifacts sent around data and models itself. So there's some changes that we need to make when we think about how we should test these different artifacts. So I always start out with kind of an intuition and the different types of

By the way, so for all the lessons we it's following a full end to end course here. But obviously, some people just come in to learn about a specific piece of content. So most of the lessons have kind of a standalone repository and notebook that you guys can explore as well, but I think it's better to see everything in the context of the larger project. It'll just fill in more the missing pieces, but I'm going to skip through some of the intro stuff here. Obviously, this you guys can go check out the lesson, but we talked about the different types of tests. This directly correlates to the machine learning world as well. You know, how you should be testing? You know, basically setting up things that you want to be testing, and then actually putting them into the different components that you are actually testing and then asserting that the outputs are what you're expecting. And there's so many different tools out there, right? These are just for Python because we're using Python in the course here. But I think every single major language has testing utilities out there. I'm going to quickly gloss over the best practices. We're actually going to do each of these, as I talked about the code data and models. But there are I feel sort of industry best practices when it comes to how we should set up your testing. Start by testing the smallest unit of code first, so it's really good to create small functions would single responsibilities, and then test those individual pieces but also test as you start to combine different functions and classes together. So testing had kind of all different layers there. And how to kind of keep your tests up to date and coverage and things like that. So we'll look at quick examples. So first, obviously testing code itself, even though this is machine learning and probabilistic components, those probabilistic components come out of deterministic code that we're writing. So we use PI tests for the course here. But they all start with kind of the major principles right having assertions about how your piece of code should run. I always start off with kind of simple examples. And then in the actual course, you know, you can we have more how it what it looks like in the context of machine learning. So let's say you have a function called just getting certain metrics or calculating certain metrics. You probably want to assert that the custom function you wrote actually behaves the way that you do. If you're using something like Scikit. Learn, for example, to calculate the f1 score, these functions have all been tested on their back end, right so you can reliably use them. When you're writing your own functions. Let's say for custom metrics or anything custom that you're doing, you're gonna want to do the same so that one that you can trust it. And number two, hopefully, you're building something reusable for other people in your team to build other applications on top of so if you can have kind of a central repo, if you will, that always made tested components, that people can quickly build their applications knowing that it's that it's that they can trust the pieces of code that are written, different granularities. To run these tests, obviously, I'm going to skip through the code part pretty quickly too, so we can get to the data and models. But you could similarly test not just functions but classes as well. There's very efficient ways to kind of set up initial values for the classes so you aren't setting them over and over the big components I wanted to cover for testing code that extend to kind of testing. The ML artifacts as well are around parametrization. So you don't want to keep saying, you know, testing the logic, writing the testing logic over and over. So you just want to specify a bunch of inputs and expected outputs. And you can just write the logic once and it'll iterate over the different inputs and outputs that you're wanting to test. So the example for you know, the machine learning use case here. Here let's say you have a pre processing function. I don't want to keep writing this the input pre processing is equally output. I'll just write down a bunch of inputs and outputs. And I'll just let this parametric size over these different inputs and outputs. Let's see. Let's keep going on to So there's obviously a lot more smaller details into how to do this really well. There's a concept called coverage which I think we should definitely talk about. People always strive for 100% coverage. And that doesn't always mean testing every single line in your repository. That just means that you've accounted for every single line and there are some lines that just don't make sense to test and when you run, let's say a coverage report on top of pi test. You don't have to have tested every piece but you should know what you're not explicitly testing. And usually those around kind of, you know, setting variables and things like that, but there's ways you can have 100% coverage. But knowing that you've covered all the important bits and excluding the parts that don't necessarily need to be tested for example, or that you want to cover later. Okay, so let's get into the meat of it now. So after code, I want to quickly talk about testing data and the models itself. Again, the best teams that I've seen in this space, used to have custom scripts for testing their data artifacts. And today there's a bunch of awesome libraries there's greatest reputations, which is more for testing data itself, not just specifically for the machine learning context, but for just any kind of data pipeline usually happens with the ELT stack actually. And there's other more ml specific libraries and or soda there's deep checks a few others as well. But I use great expectations. I've been using them for almost four years now. So and they've naturally kind of grown into satisfying a lot of the requirements for machine learning use cases as well. But the great thing about using a library like this, alright, writing your own scripts, is that you basically just take your dataset and you can for example, with here you encapsulated using the greatest vacation abstraction. And with this, you get a lot of out of the box type of expectations. So with code we wrote, we design the expectations, but with data, you know, regardless of the modality or the dimensionality, we can there are a lot of out of the box expectations you can have so let's look at a few of them. Because even in our data set here, you have some categorical variables. You have some unstructured data, you you can even have, let's say floating points, etc. So the notion is called expectations here, similar to assertions, but the out of the box ones are really and they're growing as well. But you know, a few simple ones for our dataset. You want to make sure, for example, that the actual columns you expect in your dataset are actually there. You want to maybe see that there's no data leaks inserted. Right. I think I've heard every speaker kind of talked about this already. So it's a very important data leak here could be as simple simple as making sure that every combination of a title and description are actually unique. So this is great. You should probably apply this to even all your data splits as well, to make sure that there is no the same sample is not accidentally inserted into the different splits, missing values, unique values, making sure that certain types, certain features of a certain type, making sure that you know for if you have a categorical variable, for example, that they come from a predefined list maybe that you haven't there aren't any new classes. So all of these are very contextual. The greatest potations website has a lot of out of the box tests you can have, but they also make it very easy to write your custom tests as well. That becomes really useful because you know, machine learning is completely contextual. And you'll want to you will be writing your own tests as well. But they provide the framework so you don't have to make it a custom script. You can it'll just be you know, two or three lines of Python and you can insert it. So writing these tests are great and you're obviously you can run these tests. But these need to be organized where you can't just keep running them on an ad hoc, and you certainly can't be expected everyone that they can't expect everyone to run this every time. So the best practices that I've seen are actually to organize this. Usually tests should be split based on kind of table wide expectations and maybe column wide expectations can have different types as well. But these are the two main classes. And once you have that you want to create something called like a sweet. This is an abstraction that they have but easier way of thinking about it would be I have a dataset and I have a set of expectations that I have that those collection of expectations will grow over time. You can call that a suite, then you're going to apply that suite on this dataset. But you may also apply to other datasets that maybe share the same features right so it's not always one to one and a collection of suites will make a project so for my this data science project, these are all the different collection of tests that I have. And those will grow over time. And, you know, a tool like various vacations makes it really easy to actually connect to different data sources. So maybe in the beginning, it's a single file like a CSV, but eventually maybe you're connecting directly to the database or the data warehouse so you can set all these connections up and you can have these suites execute every time you make a change to a code or every time there's a new with a new version of the data coming in. And you want these you can actually automate these to run on your pipelines as well. So just a very powerful tool. I think a lot of the tools today around testing enable this kind of orchestration and the last thing I mentioned about this is in the course right, testing is something that you do regularly, but it's hard to teach something that you do cyclically in a linear fashion. But I do talk about what this looks like in production. After kind of v1, maybe you're testing on the data set that you have right now, that's not where most teams actually have their tests. The the most mature teams actually put their tests at a much earlier point. So I'm, by the way greatest, there's also a lot of documentation that's generated as well. So production this is actually what it looks like. So instead of testing in your in your specific machine learning repository you're going to want want to put the bulk of the testing way, way upstream because your one machine learning application at this point is not going to be the only consumer of that data. So it makes sense to have a lot of the validation happened way upstream, for example, right after you extract it from the actual source. Maybe you have some tests that need to pass after it's being ingested using a tool like air, air byte or five Tran into a warehouse for example, maybe another couple of tests are executed after you apply some transformations, transformations, maybe other tests are also applied. So you have tests kind of applied after each stage in the ELT stack way, way upstream. And now, you know, my machine learning application, your machine learning application, maybe they all share the same data source for example, we can all benefit from these tests that have already run for us. And if we run way upstream, we can you can just kind of look at the reports and obviously our two applications may have additional tests as well. So those then you can run it your specific repository, if it doesn't make sense for everyone or you need certain things to look a certain way. So in general, every time there's a movement or transformation, it's a good idea to place a test there, because some things are just not in your control. And unfortunately not not everything is fully communicated. So this is a great way to catch issues before you know they happen much further downstream. Okay, I think we just have like a few more minutes. Real quick about models.

okay we have Okay awesome with models, I kind of split it into three different categories here. So what does testing models look like for training than the actual model itself? And then for inference, so for training, you know, the actual process of training, maybe you want to check things like the shapes and values of the intermediate and the ultimate outputs from your model. You want to check, you know, perhaps during a single batch that the loss is actually decreasing across the different iterations. This is a really good one that I think a lot of people don't do, but it's very easy to do just overfit on a batch, right? Any kind of model that you're developing here, you should be able to overfit performance on tests that may not be great, but that's okay. You're just testing that the logic works in terms of actually producing the kind of the loss and the pattern lost their train to completion obviously, actually, you know, run through and make sure things like early stopping and saving mechanism is all those things work. And this is a big one, make sure it works on different devices you know, maybe if you're doing something small scale locally, but then your your actual production will run on GPUs or GPUs run run the tests on that and there's amazing tools coming out. You know, this year, actually, in the next couple years as well, that make this switch of context, very easy. So it's not, you know, a completely different style of work for you to write scripts on the cloud or run the same tests. It's becoming sort of like the infinite laptop, if you will. So a lot of amazing work happening in the testing space. And again, in the in the course, we apply a lot of the concepts from the code in the modeling here as well to make it more streamlined. There's also the concept of, you know, actually testing your model itself. And this domain is huge, right? Depends on your model. your specific application, our task is NLP here. So I'm sure a lot of you have heard about this, you know, behavioral testing for NLP models. There's kind of three big pillars here, right invariants. Here, basically, any change that I do, it's the type of changes shouldn't affect the output. So for example, if I changed this, these words here, for example, Transformers applied to NLP have blanked the ML field. For my task, These things shouldn't change the output. And again, I say for my task, because if your task is something else where these should change the output, then this is not invariance. But these are just different types of tests that I write for the model itself. And if you notice here, it doesn't matter what the model is here, right? The model could be deep learning could be roll based, could be anything. These types of tests are agnostic to the actual model itself. And they you should they should be always passing so almost like sanity checks here that you want to answer. And again, these can be part of your actual testing suite, so that they're all being run every time I make an update. And these aren't things I'm running manually. Obviously a lot of adversarial testing you can do as well. And then once you have these kinds of tests, again, we parametric size these so you can run them pretty easily. Once you're done with the training bit the actual model, but the last big section is around the actual inference. You want to be you should test that you're actually able to load the different artifacts that you've created. You should be able to test that you can run simple predictions. And you know, let's say if it's a if it's a REST service here that actually goes through and we use make file to orchestrate all this but later on in the course we actually use for example airflow to make sure these things actually run and GitHub CI CD to enforce that they run without us manually executing the tests. But okay, so that's like a quick whirlwind tour of testing. Obviously, there's a lot more ties to testing or monitoring, etc. I'm happy to answer a lot of questions. I'll paste the links right now as well for all of this. But definitely check out the course as well. It's it's all free. It's open source. And I keep it up to date as i i work with a lot of companies in different industries now. And different scales as well. And I keep this stuff up to date, in a way that speaks to all the contexts. And then obviously everything is code as well. So I think it's really important to implement these things. So you can see what it looks like in practice. I'll stop the share now and I'll share the resources. Now I'll start answering questions as well.