Evaluating Machine Learning Models

Evaluating Machine Learning Models

In this talk, we will introduce the main features of a Machine Learning (ML) Experiment. In the first part, we will first dive into understanding the benefits and pitfalls of common evaluation metrics (e.g. accuracy VS F1 score), whilst the second part will be mainly focused on designing reproducible and (statistically) robust evaluation pipelines.

The main lessons learnt and takeaway messages from the talk will be showcased in an interactive tutorial, with applications in biomedicine.



That was a great introduction and lots of things to follow. Follow on. So please let me know if you don't hear me well. I'm going to share my screen in the meantime. And I will also try to monitor the chat. If there's any burning question on please raise your hand feel free. No worries at all. Right. So this I'm going to tie myself because timing is always the essence here. So this part is about evaluating machine learning models. And and in this presentation, essentially, I would like to argue give you a little bit of introduction of the basic components that would that you would need to evaluate your machine learning models. So let's starting from the very beginning, I just want to tell you, you know, let's put some, some terminology here to understand each other. But its basics. If you're familiar with machine learning. Essentially we have some domain objects where we extract some features, which is the relevant information we use in the model. We also call them the data and we have some tasks to solve the output is the prediction generated by the model. And in addition to the model, the model is essentially an instance of a learning algorithm and the whole combination of learning algorithm and the eastern side of the model is the learning problem we are trying to solve and depending on how this training data is performed, you can have supervised learning or unsupervised learning. For this for this talk, we will focus on supervised learning without any actually loss of generalizability So, I have essentially two main objectives with this talk. The first one is to provide you some descriptions of the basic components that are required to carry out a machine learning experiment and we will understand in a second, what do I mean by that? And what I would like to provide you is an overview of the basic components, not just recipes, so just like to understand the very, very basics and and the second goal is to give you some appreciation of the importance of choosing measurement in the appropriate way, are in other way. In other words, sometimes just measuring the accuracy which is seems to be the most used metric is might not be a good idea might not be the right measure to do to use. It's not always the case and we'll see. We'll see. We'll try to understand what's what's the point that so, in explaining a bit of machine learning experiment, we have to we have to deal with different elements here we have the research question, which is what we trying to achieve. We have the learning algorithm in particular, I tried to play a little bit with the colors. So let me let me let me know if that's clear or not clear at all, feel free to interrupt me. We have the learning algorithm. So we have the algorithm and the essence of those algorithms. So I am going to be the models and we have some data data sets in general. classical examples, those research questions might be how does model M perform on data from domain D, for example, or a much other question actually could be how model M would also perform on data from a different domain. That's a completely different question to ask. Or, for example, which of these models so we have different instances of the same algorithm has the best performance on data from D? And or another kind of question might be which of these learning algorithms keeps the best model of on data from the all this question, it's actually pretty standard question we tend up in asking ourselves when we set up our machine learning experiments, and if you're wondering the difference between the last two questions essentially is like, how do I play with hyper parameters in order to understand which is the best distance learning algorithm, I want to use my data? And then the second and last and the last question, on the other hand is which is the best model on on my date? And there's there's an explanation for that and the very last slide about it, and the reason why we want to do that so um, in order to set up our experimental framework, we need to investigate essentially two main questions. There's actually three but we in this talk, we're gonna go for the main two. And the first is what we need to measure, first of all, and the second is how we how we measure it. The the last question is actually how we do interpret the results. That will be the next step. We don't have the time to call it that. But in essence, in other words, the last question means how much risk, how much your results are robust and reliable in order that you can trust them, even from a statistical point of view. So let's start with what to measure and I want to start this with with the very basics and we're going to talk about the very you know, where everything starts, which is the, the so called confusion matrix are and for those of you not familiar with this one, let me just introduce a little bit the basic components, these slides are going to be a little bit dense, but play with me a little bit and let me know if that's not clear at all. So the confusion matrix, we're consider a binary classification problem. Also for the rest of the slides. So just with no lack of, of generalizability but still, to make things easier to understand in my opinion. So we start with in clockwise order this this matrix is actually composed by four different components. We have the true positive, which are essentially the positive samples correctly predicted as positive we have two classes so positive and negative in general. We have the false negative which essentially are the samples which have been wrongly predicted as negative, they should have been positive. And the sum of those is the condition positive. So the total number of real positive samples we have in the positive class the total number of samples we have in the positive class. The true negative is the negative samples correctly predicted as negative and the false positive as the negative samples wrongly predicted as positive. So and the sum of those is the condition of negative or the total number of negative samples we have in our dataset. The sum of everything in this matrix is t. So p plus n is the total number of samples we have in our data set. So these are the four essentially basic components on which lots of lots of different metrics are built upon, and and essentially derived from so first of all, we have the potion of positive so it's the ratio between P and T and the portion of the negative is the one minus POS or the portion of n over over t. So now we things on starts to get interesting, in my opinion. So we have the main primary metrics. And so all the other metrics we're going to talk about are essentially derived from from this ones. We have the true policy rate, also known as sensitivity also known as recall, for some reason in the literature. People over the years tend to call with the same with different names the same thing essentially, recall or sensitivity depends on actually the context we're talking about. If it's, for example, machine learning or information retrieval, you might have different terminology, but they do refer to the same thing. It's actually the rate of true positive cases. So TP over other P. I tried to highlight in the matrix the squares we are considering every time in the case of true negative rates is essentially true negative over n. And the confidence, also known as the precision and from information retrieval is the ratio between the true positive over the true positive plus the false positive. Now we get through the first with the secondary metrics. So we're not going to cover all of them all again, just to cover two slash three of them. And we're going to see the difference this week because one of the main point of this part is like, are those metrics really the same? Or is it really always the it doesn't really matter which metrics are which metric are we going to choose, or better warded? What's the difference between choosing one metric over the other so I probably look secondary metrics as also known as f1 score is essentially the mnemonic mean between precision and recall. Also known as the sensitivity or specificity, sorry, the confidence. The other and the super popular metric we lots of people actually tend to use in the evaluation is the accuracy and the accuracy. Looking at the confusion matrix is essentially the ratio between the good predictions so the true positive plus the true negative over the sum of positive and negative that's the accuracy essentially. So at this point, well, there's there's also this one, I'm Bear with me, another minute. This is the very important metrics I want you to use. This is on the on the spectrum of the little less or less famous ones, but actually, in my opinion, in my experience, this is one of the most effective so the Matthew correlation coefficients MCC is actually differently from other secondary metrics. There's still a secondary metrics on different from all the other secondary metric states considering all the possible components are in the confusion matrix. So let's analyze the components of this of this formulation. Essentially, we have the good results. So the good predictions minus the bad ones. And so the good the bad and the ugly ones is the ugly one, essentially all the possible combination of good or bad and in the denominator. So that's the idea of maturation efficient, so we're going to come back to that later. But that that's the metric I want to I want you to work with. So let's let's start by asking is accuracy a good idea? So in general, when we have some data for evaluation, so that's the formulation of accuracy on top of the slide so we have some data for evaluation as representative for any future data, right? So imagine we have some data that we want to use to evaluate a model. I haven't told you yet where this data come from, imagine someone is giving you this data, which should be in general the case nonetheless. The model may need to operate in different operating context. And the operating aim of different operating contexts might mean for example, different classes tribution. So if we do treat the accuracy on future data, as a random variable and bear with me, this is the only thing I'm a little bit more technical, I'm explaining the slides and we take the expectation as a random variable. So we assume in the same case, in the simple case, uniform probability distribution of the portion of positive essentially we end up understanding that the are accurate or better on measure. So this one is the average recall what we call the average recall essentially is like true positive rates, the half of true positive rate plus the half of true negative rate. What this all means is that if we don't know what's going to be the future operating context, if we choose the accuracy, we're essentially assuming that we're we're assuming that the data we're using to evaluate the model are going to be the same distribution of the data we're going to see in the future. So it's a huge assumption we're doing if we don't don't know anything at all, maybe recall is more, you know, it's probably a better, better choice on this particular case, if we don't know anything about the future distribution. So let me give you an example. Let's imagine we have the same dataset, two different models. On the left hand side you have this confusion matrix. So we have 100 samples in total 80 Positive 20 Negative the first model and one is actually producing the results you see in the in the confusion matrix, the accuracy on the first case is ope point eight, the average of all your end is oh point 88. If you go on the second, on the right hand side, we have the second model on the same data, which is producing 85% of accuracy. So confer to m one seems a little bit better. But if you have a look at the average recall, we have the oh point 72. So if we just look at the the accuracy we might be tending we were leaning towards them to on the data, but if we, if we carefully look at the results, actually, it seems that no one is doing a better job in predicting of the negative plus, and this is one of the main issue we have with accuracy. We're just considering the positive classes. So the average recall is helping us in reaching this this, this point in better understanding what's going on. So is accuracy a good idea? Probably not really in this particular case. Let's have a look with the f1 measure I know. So, a recalling f1 measure is simply the Modic mean between precision and recall. So on the right hand side or sorry, in the left hand side we have model M two on a performing on D. In this case, I'm going to tell you that we're going to switch the dataset not the model. So we will fix the model right now we change the data in this particular example. And what I want to show you here is like if we taking those results from this confusion matrix, the f1 is going to be no point 91 The accuracy is no point 85 Is the same results we got from the previous example. But if we move on a different dataset like this one, instead of having one under sample we have 1000 sample and in particular we have lots of negatives that we didn't have in the previous example, the accuracy is going to go to the roof 9.9 99 whereas the f1 measure is still staying on no point 91 is essentially the same thing. So what's the lesson we can derive from this one? The thing is f1 is has to be preferred on the in domains where negative events and most importantly, negative or not the relevant plus we want to classify. And so the takeaway message from this example is accuracy. In the first example accuracy was essentially taking a look at just the correct predictions, but it was not considering at all the negative classes. In this particular example here we have lots of negatives, but negatives are not the interesting bits here. And in this particular classification problem, we want to do better on the good ones or sorry on the positive ones. So if we have a look at just accuracy will actually we definitely go with with them too. But if it we might we might say no actually 99.9 to one is the right measure. And I just want to clarify here. What could be an interesting case our negatives are here. So consider that precision recall. f1 are all coming from information retrieval, information retrieval, you're interested in the right results returned by a search engine, not the negative ones. So typically, that's the case in which for example negative are not relevant at all you're not interested in in the wrong results, your more interesting results you'd return from your search engine query. Okay, so what's the case in which these is not the case? So like, we do want to account for all the possible classes. And so let's have a look at this example. Back Back again, it's again, on this particular for confusion matrix we have, again, one under sample, and this particular case, we're doing a very corner case but could be could be could be the case most of the time, if we don't account for classes tribution properly, which is this classifier and to in this particular case, is doing a brilliant job in classifying the positive one is doing a terrible job in classifying the negative ones. So we have no negative at all predicted. So if we can't if we can't if it comes to f1 We have no point 97 accuracy, no point 9895 AMSEC is undefined. And this means that this there's something wrong here we can't actually predict predict any performance because it's totally wrong. On the other hand, in this other case, we do a different with a slightly better on the negative but still, there's still confusion on what's going on there. As in we are doing a good job on the positive as decent job in the negative but still we are missing some errors on the negative if you do the counting here. The still the f1 and the accuracy are over the roof so they're very high. So you might be tempted to say okay, my model is actually doing pretty well. But the thing is, we're we're doing wrong on the negative class and the negative class counts in this particular case. And and so the error on the negative must be weighed differently, because that's the list represented class. And so the MCSE is telling you well, it's only point 14 is not going to go anywhere. I should probably mention that the MCSE different from the other metrics is a metric which doesn't go from zero to one goes from minus one to one. And essentially you have to read this metric in this way. So minus one is completely in means you're classifying the clusters in a completely opposite way. So positive from the negative and the other way around. mccf zero means random prediction, and MC is your one mins are good prediction. So takeaway message from here. MC seem to be referring in general when predictions on all classes count. So takeaway lessons are essentially children already on and but the main takeaway message I want to convey here is when you do an experiment, don't just record accuracy, instead keep track over the main primary metrics and the secondary metrics can be derived afterwards. Okay, so now the second question is how to measure things. Are we focused for the first part in the metrics side net? Let's focus on the how we do prepare the data to do that. So, in evaluating the supervised learning models, in general machine learning models it the whole process might make seems quite straightforward. You have to train the model and calculate how the model is doing on choosing the metric you want. You want to calculate how the model is performing? Well actually, this is not the right way of doing it, because this would be an over overly optimistic estimate, also known as in sample error. What you want to do instead is to your goal is not to evaluate the model on data you have right now in the training data, you want to overlay your model on data the model has never seen before, or in other words, data the model will see in the future. So and so that's technically called the out of sample error. So how do we do that? One simple and clerical way of doing it is the holdout evaluation. So you have your your whole data set. You have the training set, and you split this in into a partition. Partition means there's no overlap between samples in one set and the other set. You have the training set and the test set, also known as the holdout set, important message here, and I'm going to repeat this quite quite often. Later. Today. The test set has to be put aside. So the test set is something you generate in the very beginning of your pipeline. So you start by tape saying okay, this is my data, I want to take away some part which I will be used, I will be using later. But the test set has to be taken outside and and you have to forget about it. So the test set has to be retaken back when you're doing the actual evaluation. From from now on when you have the test set, the training and a test set you take you do everything on the training and the rest of it and only the evaluation of the last step on the test set. So again, the test set is only for performance valuation, nothing else. So to do that, we can simply split to train and test is a simple Python function. It can be a utility function from cycle learn, train to split to do to the job. And that's it. The problem is that this kind of a train test split works but it's a sort of a weak our way of evaluating performance because it's highly dependent on how you select the samples in a test partition. And remember, we're considering this third position as representative for future data. So we say okay, this is the data we want to evaluate a model and this is going to be a good test because the model in the future are going to see data which resemble what I'm using right now and then the test partition. So what can be a better way to do that we can do a slightly more sophisticated partitioning process, which is called cross validation. The idea is that we do generate generate several test partition and we use them to assess the model so we have the data set, which then we'd run them to split the data into K partitions, also known as false. And then we use this false in term. So for k times, we fit the model on the k minus one petition, the blue ones you see and the pink one are essentially going to be used as evaluating petition. So the test petition and validation petition and we repeat this process all the time, on M on k different models. This is the important bit to remember all the time you have a brand new model you're training on a different partition and then you average over the different fold and so you have chosen the metric arm and in the beginning, you can average the value of the metric over the different partition and you have some indication of how the model is performing on different versions of training and test British remember the deal is always the same with the test data test files must be must remain and seen on the model during the training. So while once you generate the test partition, you have to use it only in the end to evaluate the model, whichever feature selection whichever processing you want to do, you have to do it on the on the train test and then apply on the test set. K can be any number the different values and literature can can be five or 10. Or if k is equal to n the sum of the total number of samples This is also known as leave one out cross validation. cross validation can be repeated which means you can change the seed and and repeat this process multiple times although you're increasingly violating the independent and unit and and uniformed, independent and what's the other one? independent distribution identically Thank you very much identically an independent distribution of the sample you have and customization can be stratified. You can which means you want to maintain the same class distribution among training and test folds. And when you have imbalanced data set and or if we expect a learning algorithm to to be sensitive to the to class distribution, and second learning factor, you have a long list of different methods you might you might choose to do cross validation, our group K fold, shuffle, split and many others. So you don't have to reinvent the wheel. And I want to just clarify in this last bit, that there's a common mistakes in which sometimes you want to use cross validation to do model selection also known as hyper parameter selection. This is methodologically wrong as parameter tuning should be part of the training. So test data shouldn't be used at all in parameter training. And I've ever been to tuning where you can do actually is like you do your petition. So you generate your test set, then you again you forget about it. And then you take the training sets, you split them in training and validation and so you do your hyper parameter tuning on this new training set and then you validate this values are so there's hyper parameter tuning on the validation set. And once you're done, you retrain everything on the full training set you added in beginning and then you validate you you assess the model performance on the test set. That's a methodologically sound way of doing on either parameter tuning on the on the on the actual data. The last thing I want to mention here is the no free lunch theorem. And so the fact that it doesn't exist in this has been proven long time ago actually, that for some data, the model working better from some other data. There are other models working better than the previous one. So essentially doesn't access the unique model which can do can rule them all. So you have to try multiple times as Jesper was saying before random forest is the baseline and and you don't have to use you don't have to shoot with a canon of deep learning for example, if you just linear model can do that. And so that's why you sometimes need on cross validation because cross validation provides you a robust framework to do this choice. So once you have your your experimental framework set up, you can essentially change whichever model you want and repeat the same process and use cross validation to see how these different models perform on different versions of test, train and test our training validation dataset. So this is the link to the to the repository in the interest of time. I think that I just want to show you the basic Yes, I'm almost running out of time. So I just want to show you the basic of what we've done on the repository. You can find different notebooks in the repository. There's different notebooks for each of the steps. Yes, but mentioned. I'm going to focus very quickly on the first two but mostly on this one on module evaluation. In the data set in the notebooks, we're going to choose our we're going to use this data set which is a sort of a toy data case. So the penguins dataset, and it's a classification problem. We want to we have some features, we want to classify the different classes backwards in the first notebook. So data preparation. We're going to look at the data and doing some analysis on the feature. We can do that because we have a limited number of features and afterwards we clean some data we remove some not with some Nana from the datasets not particularly useful. And then we do we start our preparation by doing some feature pre processing. The only takeaway message from this part is remember that whenever you want to do any pre processing, always remember train test split first and then you applied the feature processing as in. You don't have to pre process your test set. You have to apply the results of the processing or the trading to the test set. That's a different thing. And that's you know, to get ready to train our data and in the evaluation in the whole parts. You we're gonna we're gonna see examples of data splitting then cross validation. Stratification is one important aspect of data splitting when you have different clusters tribution when you're splitting in training and test you want to keep on the distribution of the classes in both the training and the tests. So you want to retain the same sort of same class distribution. In other words, when you're sampling randomly when due to splitting, you're not sampling in your whole data set but you're sampling in the class buckets. So this is better for future future stability of the predictions. And this This is sort of the case. So we have one class which is less represented. So if we just run them split the data we might end up not having any sample at all, in the Indus test set which is not robust at all because we want to evaluate and all the possible samples we have. But the probably the most interesting bits on for me is like this module evaluation. Bits. Yes,

Oh, brilliant. Okay, so I'll stop here and let's go with the with the questions. All we can do late we can do in the end as you prefer. We could probably do in the end. I will keep track them. So I don't still time to

That's totally fine. I just want to say that if you if you look at the NOC book on it, we had a quick joke about it because since we're using we're using a toy dataset, I should find a way in which I should broke it to prove the point I was trying to make. But in essence, in this in this choosing the appropriate evaluation metric section in this notebook, what I did was choosing two classes. So the the most representative and the least representative class, so we just have in two classes again, binary classification to make it simple. Not because there's any limitation of the metrics doesn't make it simple. And then I chose the features which was the most confusing one. So I looked at the different features we had and said look, if we do classification, choosing this feature, this two combination feature so the flipper length and the common deft. This might be quite hard for a model to figure out which which of the two classes might be and so what I did was essentially proving the point that with this particular setting of the data, the same data just filtered a little bit. If you use in cross validation and then evaluation, if we use the accuracy and the MCC, you end up with very different results in accuracy, which again, remember it just consider the good ones. So not the neck, the bad ones, as the MCC does. If you look at the accuracy in the average accuracy and the cross validation, you see no point 70 Which seems reasonable actually it's definitely not because it's we're doing lots of lots of errors on the negative here. And this is something we should spot in evaluating our model. So yeah, there was a point but you'll find the details in the interest books over had more than happy to go through afterwards on Discord later today. Not no now. Let me have a look any.